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ABSTRACT 

Some sequential, distribution- free pattern classification 
techniques are presented. In many classification problems, the 
observations on which the classification decision is to be based are 
costly to measure. A sequential test seems appropriate since ob- 
servations are measured only until enough information is known to make 
a decision with a certain level of confidence. Also in many cases, the 
only information available about the pattern classes is a set of training 
samples from each class. Since the underlying probability density 
functions are unknown, distribution- free classification methods are 
needed. The specific decision problem to which the proposed classifi- 
cation methods are applied is that of discriminating between two kinds 
of electroencephalogram (EEG) responses recorded from a human 
subject - spontaneous EEG and EEG driven by a stroboscopic light 
stimulus at the alpha frequency. Sequential, distribution- free methods 
are suitable since it is generally desired to terminate the EEG recording 
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as quickly as possible and since there is no knowledge of probability 
density functions underlying the EEG waveforms. 

The classification procedures proposed make use of the theory of 
order statistics. Estimates of the probabilities of misclassification 
are given. One of the methods presented is an estimated version of 
the Wald sequential probability ratio test (SPRT). This method utilizes 
density function estimates, and in formulating this test, a new 
probability density function estimate is proposed. Convergence in 
probability of the estimate to the true density function is shown. The 
other method presented is a sequential version of the separating hyper- 
plane approach to pattern classification. 

The procedures were tested on Gaussian samples and on the EEG 
responses. Smaller error rates were easier to obtain with the 
estimated SPRT. In particular, error rates as low as .1% were obtained. 
With sequential tests, it is possible to specify the probability of error 
decisions before the test is conducted, and the experimental error rates 
of the procedures a,gree with the specified error probabilities. 
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CHAPTER I 


INTRODUCTION 

1,1 Pattern Classification Problem 

In the pattern classification problem, a pattern Is given that 

was drawn from one of several pattern classes , and a decision must 

be made as to which class the pattern was drawn. In order to classify 

the pattern, a way must be found to characterize the pattern, and 

then a method must be developed of processing the characterization 

of the pattern to classify it. It is usual to attempt to characterize 

12 s 

the pattern as a set of s real numbers x = (x ,x ,...,x ). The 

components x^ of the pattern vector are called features and are usually 

measurements of various attributes of the pattern. The choice of 

features to characterize the pattern is called the feature extraction 

problem. While any nuinber of pattern classes is possible, this report 

will consider only classification problems with two pattern classes 
1 2 

C and C . Once the observation x has been characterized as a vector, 

the problem of classifying x can be formulated as finding a scalar 

function g(x) such that x is classified as coming from if g(x) < 0 

2 

and as coming from C if g(x) > 0. 

In viewing the classification problem geometrically, each pattern 
has been considered as a point in an s-dimenslonal space. Thus 
g(x) » 0 is a separating surface that divides the sample space into two 
regions corresponding to classifying the pattern x as coming from 
or 
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In most meaningful classification problems j the two pattern 
classes overlap to some extent and so are not separable in the 
s-dlmensional space* The objective in this case is to construct 
a classification procedure that is optimal in some sense as regards 
mis classifications . 

Since a pattern can be treated as a set of real numbers, the 

two pattern classes will be characterized in this report by the 

1 2 

probvabllity density functions f(x|C ) and f(x|C )• This does not 
mean the density functions are always known but means that the patterns 
from each class can be treated as random variables with a particular 
probability density function. It may be that the density functions 
reflect noisa in measuring the features, or it may be that the 
patterns themselves follow a particular density function. 

Before proceeding to a more detailed discussion of pattern 
classification methods, an example of a classification problem will 
be given. 

1.2 Electroencephalograms 

The application of pattern classification techniques to the bio- 
medical field has received increasing attention in recent years. One 
specific area that has been studied is that of making decisions about 
the state of a patient based on electroencephalograms (EEC). An 
EEC is a recording of the electrical activity of the brain* From the 
E£G waveform, some assessment can be made on the state of the patient; 
for example the level of consciousness of the patient can be determined 
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or some pathelogical conditions of the brain can be detected « The 
electrical activity is measured by electrodes on the surface of the 
scalp, and the EEC wave is generally considered to be a recording 
of the gross activity of a large number of cells. An EEG response 
can thus be considered to be a sample from a random process. The 
pattern classification aspect of the problem now becomes apparent. 

An EEG measured from a patient placed in a darkened, soundless 
room isolated from external stimuli is called a spontaneous EEG. If 
a light is flashed periodically into the patient’s eyes, the resulting 
EEG wave between two consecutive flashes is called an evoked response. 
This report will treat a classification problem to determine whether 

ic 

given EEG responses are spontaneous or evoked. As mentioned pre- 
viously, in order to classify an EEG wave, a set of features to 
describe the wave must be extracted, and a decision rule to classify 
the set of features must be formulated. 

I . 3 Feature Extraction 

Prabhu [1] has written a paper that discusses feature extraction 
for the EEG classification problem. As recorded from the patient. 


Although the flashing of the light can be readily detected by 
merely observing the light, this thesis attempts to make the 
decision on the light by observing an EFG response from the 
patient. The decision problem considered here is a first step 
toward more meaningful problems such as determining the level or 
unconsciousness of a patient during surgery. An unconscious patient 
would react differently to a light stimulus than an awake patient. 
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the EEC is a continuous waveform of the amplitude of the electrical 
activity. The response between two consecutive flashes of the light 
is considered to be one sample. To facilitate the use of a digital 
computer, the amplitude was sampled in time at a set frequency so 
that each sample EEC response was a vector. If the sampling rate is 
high, the dimension of the sample vector may be quite large. Since 
the complexity involved in finding a suitable decision rule Increases 
as the dimension of the sample increases, a subset of the features 
may be selected to be used in the decision rule. Prabhu [1] has 
developed a feature reduction scheme that picks a subset of the total 
number of features. The features in the subset are selected 
according to their effectiveness in some sense for classification 
purposes. This feature reduction method is discussed in detail in 
Appendix II. 1 and in Prabhu [1]. 

1.4 Structure of the Classification Problem 

Now that a set of features has been extracted so that the EEG 
responses can be represented as vector samples, a decision process 
for classifying the EEG samples must be developed. The purpose of 
this report is to develop some classification techniques that are 
applicable to a class of problems represented by the EEG decision 
problem. Before discussing the specific properties of this class of 
problems, some general considerations of classification problems 
will be presented. 

In classifying an observation x, the two types of errors possible 
2 1 

are to decide x e C when actually x e C , called error of type I, 
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1 2 

and to decide x £ C when actually x £ C ^ called error of type II . 
Criterions for evaluating the effectiveness of decision rules are 
usually expressed in terms of the probabilities of these error 
occurring. Let a * p(error of type I) and 3 * p(error of type II). 
Three examples of criterions expressed in terms of a and 3 follow. 

1. ) If the prior probabilities of an observation coming from 
C or C , p(C ) and p(C ) respectively, are known, then an expected 
loss function associated with a misclassif icatlon can be expressed 
as 

1 2 

E(loss of misclassif Icatlon) “ Lj^ap(C ) + L23p(C ) 

where and are the cost of errors of type I and type II. A 
possible criterion is to formulate a decision rule to minimize 
the expected loss function. The Bayes test [2] satisfies this 
criterion. 

2. ) Another possible criterion is to require that a be below 
a specified value and then minimize 3* This criterion is followed 
by the Neymann-Pearson test {3], 

3. ) If the number of observatloie drawn before making a decision 
is variable and not predetermined, a decision rule can be devised 
where both a and 3 site below specified values* The Wald sequential 
probability ratio test [4] satisfies these conditions and minimizes 
the expected number of observations needed for a decision* 

Another factor that Influences the choice of methods for solving 
a classification problem is the type of information known about the 
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two pattern classes « When the probability density functions describing 
the pattern classes are known, there are many well-knoxm decision 
tests that can be used, such as those already mentionedo In many 
cases, however, the density functions are unknown, but sets of samples 
drawn from each class are known. These sets of .‘samples from each 
class are called training sets . When training sets are the only in- 
formation available, pattern classification techniques must be formulated 
from the training sets without using the density functions. 

The develonment of a decision procedure then depends on two 
factors: 

1, ) the Information known about the two pattern classe^^, and 

2. ) the criterion. 

The choice of a criterion is influenced by the Information available, 

e.g. if the density functions are unknown it is not possible to 

minimize the actual probability of a misclassificatlon but only 

perhaps an estimate of it. The criterion also embodies the 

characteristics that are important to a particular decision problem, 

such as the number of observations that may be taken before a classification 

decision Is made. 

1,5 The Approach Taken In this Report 

In the classification problem of the EEG waves mentioned in Section 
1.2, the underlying density functions of the EEG waves are unknown. 

But it is generally possible to record a series of EEG responses from 
the patient to use as training sets. Pattern classification procedures 
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f. 





that do not involve kncwledge of the underlying density functions 
are called dlstrlbution-free . The techniques proposed In this 
report ^re distribution-free. 

In making medical tests on a patient^ the measurements are often 
costly and discomforting to the natlcnt. Thus it seems desirable 
to terminate the measurements as quickly as possible, but at the same 
time the final decision on the state of the patient must be made with 
a certain level of confidence. A sequential test appears appropriate 
for many bio-medical classification problems since observations are 
taken one at a time only until enough information is known to make 
a decision with a certain level of confidence. In sequential teats, 
the p (error of type I) and p (error of type II) can both be specified 
before the test. Sequential tests are suitable for the EEG decision 
problem since the stroboscopic light can be flashed and responses 
sampled on demand until enough data has been gathered to make a decision. 

As mentioned in the previous paragraph, sequential methods take 
observations one at a time until the string of observations provides 
enough information in some sense to classify the observations* If the 
observations are vectors, a whole new vector observation of the several 
features is taken. After each observation is taken, three outoomes are 
possible; 

!□) decide the observations taken so far are from 

2 

2*) decide the observations taken so far are from C 

3.) decide to take another observation since not enough 
information is known to make a decision. 

Stated analytically, the classification problem using the sequential 
method is to find a scalar function and two thresholds such that after 
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t observations have been taken 

1 

g(x^,X 2 * . . . ,x^) < B decide C 

B < gCxjjX^, . - . ,x^) < A take another observation 

2 

g(x^,X 2 > . . • jX^) ^ A decide C 

Since the two thresholds can be set independently# it is possible to 
construct a sequential test where the p(error of type I) and p (error 
of type II) are both specified to be certain values « As an example, 
consider Figures I.l and I.2o In the test using one observation 
shown in Figure I.l, two outcomes are possible, and a decision is 
made according to which side of a single threshold the observation 
lies. Since only one threshold i? used, the probabilities of type I 
and type II errors cannot be set independently. In the sequential 
method of Figure 1.2, three outcomes are possible after each observation 
is taken. The two thresholds that separate the three decision regions 
can be set independently, and hence the probabilities of erjrors of 
type I and type II can both be set to specified values. Since only 
enough observations are taken to make a decision with the confidence 
that the p (error of type 1) and p (error of type II) have certain 
values, the sequential method has the merit that test procedures can be 
constructed which require, on the average, fewer observations than 
equally reliable test procedures based on a predetermined number of 
observations [4]. 
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Decide x € C’ Decide x eC^ 

take 



FIGURE 1.2 

Error Probabilities for Sequential Test 
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The classification procedures proposed in this report are dis- 
tribution-free and sequential* The methods are applicable in 
classification problems where: 

1*) the density functions of each class are unknown but training 
sets are known, and 

2.) a string of a variable number of observations, all from the 
same unknown class, can be sampled on demand, 

1.6 General Outline 

Two types of sequential, distribution-free procedures arc presented 

in the chapters that follow. In one, a series of thresholds are 

calculated from training samples, and each observation that is taken 

in the sequential sampling is compared to a different pair of thresholds 

depending on the number of the iteration. In the other approach, the 

same pair of thresholds is used throughout the sequential procedure, 

and the scalar function of the observations that is compared to the 

thresholds is altered at each iteration to include the information 

contained in the new observation. Chapter II describes the former approach, 

c 

and Chapters III through VI are concerned with the latter. 

Chapter II presents a brief review of the theory of order statistics 
and then uses some results from order statistic theory to calculate a 
set of thresholds for a sequential test* Tlie thresholds are calculated 
from the training sets in such a way that an estimate of the probability 
of a misclassifi cation is obtained. Multidimensional samples are 
treated by transforming them into scalars with a linear transformation. 
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Experimental results are shown for the procedure tested on both Gaussian 
and EEC data. 

Chapters III through VI are concerned with the estimation of 
probability density funct:'.ons from training samples and the use of 
density estimates In a sequential test called the sequential probability 
ratio test (SPRT). The SPRT utilizes the ratio of the two density 
functions representing the pattern classes. The ratio of the densities 
Is evaluated at the values of the observations and compared to two 
thresholds. Since the density functions are unknown in the problems 
considered in this thesis, estimates of the densities are used in the 
SPRT. Chapter III discusses some approaches for estimating density 
functions and surveys several known estimates. A new density estimate 
is proposed in Chapter IV. The estimate is of a step«f unction form 
where the boundaries of the steps are determined by the training 
samples. The estimate is shown to converge in probability to the true 
density as the number of training samples tends to infinity. 

Chapter V begins with a discussion of the SPRT, and then formulates 
an estimated version of the SPRT with the new density estimate. The 
new density estimate was chosen because of its low coioputer storage 
requirement and ease of calculation. Experimental results are shown 
for independent Gaussian samples. Some techniques for handling multi- 
dimensional samples and dependent observations are discussed in Chapter VI. 
The methods involve taking a linear combination of the features of multi- 
dimensional samples or taking the sum of several dependent observations 
so that only scalar samples are considered. The procedures are tested 


on EEC data. 



CHAPTER II 

A SEQUENTIAL BISTRIBUTION-FREE PATTERN CLASSIFICATION 
PROCEDURE USING ORDER STATISTICS 

This chapter presents a sequential* distribution free pattern 
classification procedure that makes use of some results from order 
statistics. The material in this chapter is self-contained, and 
future chapters do not depend .on what is developed here. 

II. 1 Introduction 

The algorithm that follows assumes the type of prior information 
and criterion listed in Section 1.5 namely that a training set from 
each class is known and the test is to be sequential. One popular 
method of solving the classification problem with training sets is to 
place a hyperplane between the two sets of training samples that 
separates the two classes of samples as much as possible. An observation 
is classified according to which side of the hyperplane it lies. Generally 
such algorithms provide no direct estimate of the probability of mis- 
classif ication, and the decision Is made based on examining only one 
observation. Henrichon and Fu [5] have formulated an algorithm which 
partitions the sample space into decision regions by training on sample 
sets of known classification and uses order statistics to find an upper 
bound on the mis classification probability. This chapter presents a 
method which attempts to improve the error in classifying observations 
from inseparable classes by taking several observat:Lons before deciding 
on classification. The observations are drawn sequentially. A distribution- 
free estimate of the probability of mis class If ication is presented. The 
remainder of the chapter describes the algorithm and experimental results. 
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II , 2 Assumptions 

The method is designed to decide if an unknown observation 
belongs to one of two classes which shall be referred to as class 1 
and class 2. The algorithm is trained on sample sets of known 
classification and is distribution free. The following assumptions 
are made about the samples: 

1. that a training set from each class is known 
ii. that the samples are independently, identically distributed 
in each class 

ill. that the random variables from each class are of the continuous type 
(thus the probability of any two samples being equal is 
zero) 

iv, that several observations, all from the same unknown class 
to be classified, can be taken since the method is to be 
sequential. 

11*3 Order Statistics and Ordering Functions 

Severd. properties of order statistics are used in this chapter. 

A brief presentation of order statistics, inco.vding some dlstribution- 
free properties, is given in this section without proof* Appendix 11,3 
may be consulted for a more detailed discussion of order statistics. 


A random variable is of the continuous type if the distribution function 
F(x) is everywhere continuous and the density function f(x) « F'(x) 
exists and is continuous for all x, except possibly at certain points 
of which any finite Interval contains at most a finite number* Thus 

F(x) = p(n < x) * [" f(t)dt t6l. A function F(x) which has these 

properties is said to be absolutely continuous. 
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Let Xj^X 2 ,***,X^ be a set of n Independent scalar random variables from 

a continuous probability distribution function F(x). The samples 

can be arranged in ascending order, ^ ^i X^ , For 

^1 ^2 n 

convenience, let the samples be relabeled, Y-*X, . ,Y *Z. , so 

i ^ If, n i 

z z . _ n 

that Y- < Yf, < < Y , In the set (Y,,Y«,,.-,Y ), each member Y 

12 n 1 2 n 1 

is called an order statistic. If X is a scalar random variable, F(X) is 
also a random variable. The random variable F(X) turns out to have 
a uniform distribution on the interval (0,1). Recall that the random 
variable F(X) can take on values between 0 and 1, and F(X) “ p(rj < X). 

So it is equally likely for any random sample X that p(o < X) be an 3 ?where 
between 0 and 1. The expectation of ^(^j) ** ^^^ 1 ^ mhown to be 


E[F(Y^) - F(Y^)] 


j > i 


(II. 1) 


Thus 


E[F(Y^^.P - F(Yj)] 


(II. 2) 


It is observed that n random variables thus arranged in ascending 
order partition the density function into itbl parts. The expected 
value of the probability of a sample falling between any two neigh- 
boring order statistics is l/(n+l). The variance of IF(Y^)-F(Y^)1 
can be shown to be 


B[(F(Y.)-F(Y.))-E(F(Y )-F(Y ))]^ - . (II. 3) 

^ ^ J (it+l)^(tt<-2) 

For dealing with multi dimensional samples, ordering functions 

are used to transform the vector samples onto the real line. Let X 

be a multidimensional random variable with a continuous distribution 


II-4 


function F(x)o If W = g(X) is a random variable with a , jntinuous 

distribution function G(w), theni(^) ordering function, 

Kemperman [7] has shown how the sample space can be partitioned using 

a class of ordering functions so that the distribution of the 

probability of a future observation falling in any partition can be 

found. An example of using one linear ordering function for partitioning 

the sample space is given in Figure 11,1. ^or the random sample 

from the multivariate, absolutely continuous distribution function F(x), if 

the transformed vectors are ordered, g(X. ) < g(X. ) < g(X. ) and 

^1 ^2 n 

relabeled, * g(X^ ^2 ** \ * s(X^^ ) then 

1 '2 n 

ElG(Wj)-G(W^)l j > k (II. 4) 

* E[p(g(x ) < g(x) < g(x. ))] 

\ j 


The expected probability of a future observation falling In the block 

■t-k 

partitioned by g(x^ ) and ) is j > k* example, 

j ^k 

12 s 12 s 

let g(x ,x ,,.,,x ) • a-x +a-x x be a linear function and 

let Xj^,X 2 >« • . be a set of n vector samples. Then if the trans- 
formed samples are arranged so that g(x. ) < g(x )< <g(x. ), 

then the expected probability of a future observation falling In the 

segment between the planes g(x. ) and g(x. ) is independent 

^j+1 

of the choice or g as well as the underlying distribution for x. 
Ordering functlous and order statistics are discussed more fully 
in Appendix XI ,3, 
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II, 4 The Algorithm 

I1.4>1 Use of Two Thresholds 

In dealing with multidimensional samples, this chapter uses the 
same ordering function throughout for any one testing procedure ♦ The 
use of a single ordering function may not be optimal for many data sets, 
but for some unimodal densities with one region of overlap the 
shapes of the data sets are such that the use of a linear ordering 
function sufficiently separates the two classes. Utilising different 
ordering functions for different iterations requires considerably 
more computation and is discussed further in Section II. 7. Of 
course, for scalar samples the question of an ordering function 
does not arise. For whatever ordering function is chosen, the 
object of the algorithm is to decide to which class an unknown 
observation belongs so the ordering function chosen should separate 
the two classes of training samples as much as possible. 

A convenient type of ordering function to use is a linear 
function. The distribution of the linearly transformed samples is 
continuous. Figures 11.2 and II. 3 show two examples of linear ordering 
functions. The two training sets In the figures cannot be separated by 
a linear function. The function g2(x) of Figure II. 3 separates to a 
greater degree the two classes of training samples than the function 
g^(x) of Figure II. 2. For a decision algorithm, the ordering function 
g^(x) is the better choice. 

Many algorithms exist ^ich yield a single linear separating 
plane between the two classes of training samples. Ho and Agrawala [2] 


gJxi.Kg) = a'x.+a'x 


Class 1 sample 
Class 2 sample 



Figure IE. 2 


Linear Ordering Function with Poor Separating Qualities 


/ ' ' 


/ / 


ggCxi.Xg) = a,x, + 0 2x 


Figure H.3 

Linear Ordering Function with Good Separating Qualities 
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give a survey of many linear separating algorithms. The equation 
of such a separating hyperplane can be used as an ordering function 
since it has good separating qualities. 

When a single ordering function is used on all training samples 
the expected probability of a new sample falling in the segment 
between any two planes, each placed through a training sample, 
is the same as the expected probability of the transformed sample 
falling between the transformed points of the order statistics. So 
hereafter, the sample points will be considered to have been trans- 
formed and all samples will be considered to be real scalars. Also 
all observations to be classified will be assumed to have been trans- 
formed into scalars. The two classes are assumed to have one region of 
overlap. For two inseparable classes of samples, the samples of class 2 
are taken to lie largely above those of class 1. See Figure II. 4 for 
an example. A decision is made by comparing an unknown observation 
with two thresholds which are placed in the overlap region. 

If the unknown observation z lies above both thresholds, it is 
assigned to one class; if z lies below both thresholds, it is assigned 
to the other class; and if z lies between both thresholds, another 
observation Is taken as z lies in the region of overlap. The procedure 
is applied to the new observation which is compared with a new set of 
thresholds. It is assumed that all new observations come from the same 
class. Figure IX .3 provides an exi^le of the algorithm showing how 
the thresholds, labeled A and B, change for each iteration. New observations 
are taken until a decision is made, and then the algorithm is terminated. 
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TAKE 

DECIDE CLASS 1 ANOTHER DECIDE CLASS 2 



CLASS 2. CLASS 1. 


FIG. n.4 

Decision Regions for Sequential Test 
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The new observation that is taken in each iteration of the algorithm 
is compared with a new pair of thresholds that correspond to that 
iteration. 

II. 4. 2 Setting Thresholds for First Iteration 
The thresholds are calculated by using some theory from order 
statistics on the training sets of each class in such a way as to 
give an estimate of the probability of a misclassif ication. The 
n samples# now scalars# from each of the two training sets are 
ordered separately in ascending magnitude. The ordering for one 
class is 

X . < X . < . . . < X . 

Let the training samples be relabeled for convenience 


yi - X , Y2 - X . 

X & n 

The training samples are now in ascending order# 


yi < Y2 < < ^n ^ 

If z is an unknown observation# then 

p (classification error) " p (classification error|z e class l)p(z £ class 1) 

-K p (classification error |z £ class 2)p(z £ class 2). 

Thus the error probabilities for each class# p (classification error 
|z € class j) j«l#2# can be calculated separately. The setting of 
thresholds will now be examined in detail for one class# say class 1# 
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and the ordered training set, ^ ^2 ^ ^ ^n’ considered 

to be from that class. The following discussion of setting thresholds 
applies to either class. 

Given the set of ordered statistics from one class » 


Y, < Yj < 


< Y 


n‘ 


the probability that an observation from this class is less than any 

member of the ordered statistic, Y., is F(Y ). From equation (II. 1) 

3 J 

E(F(Y )) = ^ . (II. 6) 

An estimated 100j/(n+l) percent of all future observations lie below 
(or 100(n+l-j)/(n+l) percent exceed ^j') Figure II. 6 gives an 
example with the two training sets together. The overlap region of 
the inseparable training sets has been taken to be at the higher end 
of the class 1 order statistics and lower end of the class 2 order 
statistics. 

In the following formulation of the thresholds, A(k) represents 
the upper threshold and B(k) the lower threshold where k represents 
the number of the iteration of the sequential test. A(k»l) will now 
be determined in such a way that p (classification error] z e class 1) 
can be estimated. If the first unknown sample lies above both thresholds. 
It will be classified as belonging to class 2 which would be an error. 

If it lies below both t^esholds a correct classification of class 1 
would be made. If it falls between the thresholds, another observation 
should be taken. SAe Figure 11.7. To obtain an estimate of the 
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CLASS 1 TRAINING SAMPLE 


CLASS 2 TRAINING SAMPLE 




XX X XX 




AN ESTIMATED 


100 j 


PER CENT 


OF CLASS 2 DATA POINTS LIE 
BELOW THE j-th SMALLEST 
VALUE OF THE n-DIMENSIONAL 
TRAINING SET OF CLASS 2. 


AN ESTIMATED 


100 j 


PER CENT 


'OF CLASS 1 DATA POINTS LIE 
ABOVE THE j-th LARGEST 
VALUE OF THE n-DIMENSIONAL 
TRAINING SET OF CLASS 1 


FIG. n.6 

Estimating Probabilities from Training Samples 
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FIG. n.7 


Thresholds for Sequential Test 
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nL-l TRAINING SAMPLESOFLIi;? 


Alk = 1) 
SAMPLES 


62 


CLASS 2 


-X^(-X — ox-oxox- 


FROM CLASS 2 


Re -I TRAINING SAMPLES OF 
CLASS I 



FROM CLASS 1 


^n’e2 


^n,-n’e,+ i 


FIG. n.8 

Estimating Thresholds for Sequential Test 


probability of an observation from class 1 falling above the upper 
threshold, the number of trai.ning samples from class 1 that fall 
above the threshold A(k=l) can be used. 


Let the threshold A(k=l) be set equal to the value of the 
-th largest order statistic of the training set of class 1, 

®1 

1 

A(k=“‘') ® Y _ 1 . , s then n -1 training samples lie above 
1 1 

A(k»l) . The superscript on n represents the number of iterations 
and the subscript the class* Thus 

E[p(z,>A(k=l)|z,eC^] - E[1-F(Y ^j^)] 

1 

and from equation (II *1) 


1 e^^ ^ ®1 


o 1- 


n--n +1 

+ 1 


n 


E(1-F(Y 1 ,,)) « 

n,-n-*- n,+l 


1 e 


(II. 7) 


If p is the desired probability of error for class 1 on this Iteration^ 
then n^ should be chosen so that 


®1 


1 


n 



nj+l 


P 


I 


* ^ 

•Sa' 

■B 





1 

A,: 
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and so solving for n 


n = (n- + l)p 
®1 


( 11 * 8 ) 


When is not an integer, the greatest integer less than is 

used; [w] will represent the largest integer less than or equal to 

w. A(k“l) is then set equal to Y r 

ni+l-[n^^l 

B(k®l), the error threshold for class 2, is determined similarly 

from class 2 training samples. As the error region for class 2 lies 

at the lower end of the ordered training samples, B(k=l) is set equal 

to the n^ -th lowest order statistic of class 2, 

®2 


n 


&2 2 


(II. 9) 


n is chosen such that p is the desired error probability of an 


e 


observation from class 2 on the first iteration. 


n 


H2+1 


P 


n « (n« + l)p . 
®2 ^ 


(II. 10) 


B(k«l) is set equal to y setting of A(k»>l) and B(k«l) is 


illustrated in Figure 11.8. 
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II. 4. 3 Thresholds for the Second and Following Iterations 

If the observation on the first iteration falls between the 

thresholds, a second observation is taken. Figure II. 5 provides an 

example. New thresholds are found for testing the second observation. 

The probability of the first observation falling between the thresholds 

# 

can be estimated by counting the number of training samples between 

1 

the thresholds for each class. Again taking class 1, let n be the 

^1 

number of training samples between the thresholds on the first iter- 
ation, see Figure II. 8. Then an estimated (n + l)/(n.+l) percent 

of the area under the density function for class 1 falls in the region 
between the thresholds. 

Actually the lower threshold is based on class 2 so that there 

Is not one whole interval between class 1 sample points but a 

fraction of one at the lower end of the region between the thresholds. 

In practice, n is usually large enough that counting the Interval 

2 

as a whole has a negligible effect on n . 

For a decision to be made resulting in a classification error on 
the second iteration, the first observation must fall in the region 
between the thresholds of the first Iteration and the second observation 
In the error region of the second iteration. If p is the de&'^ired 
probability of error for the second iteration, then we desire 
p(lst observation between thresholds )p (2nd observation in error region) » p 

p(B(k“l) < < A(k«l))p(z2 ^ A(k«2)) ■ p 
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But p(B(k«l) < < A(k=l)) and ^ A(k=2)) are unknown, and 

they can only be estimated* So the number of training samples 
in the error region for the second iteration is chosen as 


+ 1 
n^+1 


"i+i 


n 


n^ + 1 

1 . 
n +1 


(nj^+l)p 


n 


°i ^ 1 

1 . , "e 
n +1 1 


(II. 11) 


from equation (II. 9)* A(k=2) is set equal to the [n ]-th largest 

®i 

training sample, A(k*2) » -[n^ ]+!’ ^(^*2) is set similarly 

^ ®i 

using the training samples of class 2. It is desired that 
p(B(k=l) < 2 ^ < A(k*l))p(z 2 ^ A(k=2)) * p which can be estimated 
by considering 


+ 1 

1 . 2+1 


n 


n2+l 


P 


and solving for n' 


n 


Ti^+l 

“2 


”2 ^ 1 

+1 ®2 


(ttj+l)? 


( 11 . 12 ) 
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B(k®l) is set equal to Y 


[n|l 


rii+1 

As — > Ij then by referring to equation (II. 11) it is 
n +1 

2 1 

see&that n > n which implies A(k“2) ^ A(k“l)j and for similar 

reasons B(k=2) ^ B(k*l). Thus the thresholds for the second 
iteration will be closer together than the thresholds for the first 
iteration. 

The number of training samples between the thresholds for 

2 2 

the second iteration are counted for each class, n and n . Then 

^1 2 

3 3 

n and n can be calculated. For an error decision on the third 


iteration both, the first and second observations must fall bet“^een 
their respective thresholds, and the third observation must fall in 
the error region. 

The calculation of the thresholds continues, with the thresholds 
for each iteration being calculated simultaneously. Figure 11.5 
again gives an illustrative example. In general. 


p(B(k«l) < < A(k»l))* • »p(B(k'-l) < < A(k-l))p(Zj^ > A(k)) ■* p 


The estimated form is 

(n^ +1) (n^ +1) (n^"^+l) 

(nj^+1) (nj+1) ■ (n^+1) (iij+1) “ 


( 11 . 13 ) 
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and solving for n 


n 


k 


1 1 


(n.+l) 

— r — K+i)p 

(n;: +1) 

1 


<11. 14) 


n 


e. 


(nj^+1) 

ic-i 

(n^ Vl) 

1 


n 


k-1 


(11.15) 


Similarly, 


k 

n 9 ■■ " ■■■ 

2 


n 


k-1 


(11.16) 


A(k) is set equal Y ^ k . - and B(k) equal Yf k 


As 


n-+l *^2**"^ 

> 1 and — >1, the bounds move closer together. 


n *fl 


k-1 , 
n +1 

^2 


Eventually, for some k, the thresholds will c^oss. This happens 
k k 

when n and n become sufficiently large that B(k) > A(k). The 
algorithm will be terminated for this value of k, and the two thresholds 
are replaced by a common threshold. Let this terminal value of k be 
called N. A decision will be made at k » N if the algorithm proceeds 
this far. In the examples to follow the common threshold was set by 
averaging the thresholds for k ^ K-1. 


A(N) • B(N) - [A(N-l) -J- B(N-l)]/2 . 


(11.17) 
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This could of course be set in other ways. 

The algorithm as presented has taken the probability of error 
and ending on each iteration to be the same for each class , 
p (error decision and end on k-th Iteration [unknown e C ) 

2 

- p (error decision and end on k-th Iteration [unknown e C ) * p» 

These could be set equal to different values if so desired. Although 

then the prior probabilities of which class the unknown observation 

1 2 

belongs, p(unknoim t: C ) and p (unknown e C ), should be known in 
order to calculate the estimated error decision probabilities. 


of Algorithm 

The application of the algorithm can be divided into two parts, 
the formulation of the thresholds and the use of the thresholds to 
classify an unknown observation. This section briefly reviews the 
steps Involved in both parts. Figure II. 5 can be referred to as an 
example. 

First the thresholds are set using training sets from the two 
classes. An ordering function is chosen that separates to some 
degree the two classes of training samples , and the training samples 
are reduced to scalars using the ordering function. The training 
sets of scalars from each class are ordered. 


Class 1 ; yj ^ ^2 ^ ^ Class 2 ; ^ ^2 ^ ^ ^n 

2 

The parameter p is chosen. The number of samples in the error 
region for the first Iteration is found. 
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n = (n. + Dp 
1 


and the thresholds are set. 


n « (n» + l)p 
®2 ^ 


A(k«l) « Y 


n.+l-[nl ] 
1 


B(k=l) = Yj„i 3 
®2 


The number of training samples between B(k=l) and A(k=l) in each 

1 1 

class are counted, n and n respectively. Then for the second 
iteration, k»2, 


n 


Uj^+1 

n^ +1 

*^1 


n 


+1 

^2 


n 


1 S - Go 


2 2 

Then n and n are determined by counting the number of training 

samples of class 1 and class 2 between B(k*2) and A(k**2), For any 
iteration k. 


n 


n^^+l 

n^“^+l 

^1 


k-1 


n 


k ”2 ^ k-1 

n = , ■ ■: — n 

®2 n^”^+l 2 


A(k-k) = j_j^k 3 


BOp«k) =• Y 


1 

®2 


It Ic i 

Determine n and n by counting the samples of class 1 and class 2 
^1 ^2 

between B(k) and A(k). Whenever A(k) € B(k), call k * N and set one 
common threshold A(K) » B(N). 
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In applying the algorithm to classify unknown observations 
each observation is first reduced to a scalar by using the ordering 
function. The first observation is compared to the thresholds 
A(k*l) and B(k=*l). If 

z^ < B(k*=l) decide class 1 

> A(k=l) decide class 2 

B(k=l) < < A(k“l) take another observation 

If another observation is taken, Z 2 > then it is similarly compared 
to A(k*2) and B(k=2). At each iteration that is needed, the bounds 
for that iteration are used. For any iteration k, 

z^ < B(k) decide class 1 

> A(k) decide class 2 

B(k) < Z|^ < A(k) take another observation 

If the procedure goes until k » N, a decision will be made then as 
there is only one threshold. 

Ho 6 Estimated Probability of Mis classification 

The probability of mlsclassif Ication for the algorithm will 
now be considered. The algorithm can end on only one iteration 
so the events of ending with an error decision on the k-th iteration 
and of ending with an error decision on the j-th iteration are 
mutually exclusive for k # j* The probability of error can be 
expressed as 
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N 

p (error decision) = ^ p (error decision and end on k-th iteration) 

k-1 

(11.18) 


where 

p( error decision and end on k-th iteration) 

= p (error decision and end on k-th iteration! unknown e C ) 

•p (unknown e C^) 

2 

+ p (error decision and end on k-th Iteration | unknown e C ) 

•p (unknown e C^). (11,19) 

Consider first the case where the unknown observations 
are from class 1, Then 

p (error decision and end on k-th iteration | unknown e C^) 

« p(B(l) < 2^ < A(l))p(B(2) < 22 A(2))-- 

p(B(k-l) < < A(k-l))p(Zj^ > A(k)). 

( 11 . 20 ) 

All the thresholds are calculated from the training samples, and so 

p(B(l) < A(l)),p(B(2) < 22 A(2)),...,p(B(k-l) < z^^^^ < A(k-l), 

p(z^ > A(k)) 

are random variables. Also since the thresholds were calculated from 
the same training samples, these random variables are dependent, and 
the expectation of the left hand side of equation (11.20) is not 
equal to the product of the expectations of the terms on the right 
hand side. As It is not readily apparent how the true expectation 
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can be found, the expectation is approximated, however, by 

p terror decision and end on k-th iteration [unknown C C ) 

« Ep(B(l) < A(l))Ep(B(2) < z^< A(2))-- 

Ep(B(k-l) < < A(k-l))Ep(z^ > A(k>), (11.21) 

The symbol p is used to denote that the term is an approximation 
of the expected value. 

By the construction of the algorithm, 

p (error decision and end on k-th iteration [unknown e C^) « p. 

( 11 . 22 ) 


A similar procedure can be used to show 

2 

p (error decision and end on k-th iteration [unknown £ C ) • p. 

(11.23) 


Thus from equation (11.19), 


p (error decision and end on k-th iteration) 

1 2 

"» p*p(unknown e C ) + p<>p(unknown £ C ) * p. 


and so 


N 

I p 

k“l 


p (error decision) 
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p (error decision) = Np. (11.24) 


As mentioned previously, Np is not the true expected probability 
of error since the product of expectations of dependent random 
variables was taken. Loosely speaking if there are one hundred 
training samples, the addition of another sample provides more 
information to revise the estimate of the probability of error 
than if there are one thousand samples- Thus as the number of 


Since p is a specified parameter, it can be shown that Np € 1 by 
showing that N, the maximum munber of iterations, has an upper 
bound of 1/p. N will have its largest value when the probabilities 
of an observation falling between the thresholds and not being 
classified at each iteration have their largest values. Consider 
first the probability of an error decision given the string of 
observations is from class 1, At each iteration, Ep(zk ^ A(k)) 
is determined before Ep(B(k) < zk < A(k)) is determined, and thus 
the upper bound on Ep(B(k) < zit < A(k)) is l-EpCzj^ ^ A(k)). For 
convenience, let pek “ ^ A(k)) and so 1-pek the upper 

bound on Ep(B(k) < z]^ < A(k)). Using these upper bounds, the 
thresholds at each iteration are found by setting 

a-p^2> ' • • (k-1) >Pek “ P 

as is done in equatioiB (11^ 13) and (11.14). For k » 1, the 
thresholds are set such that pgX ® p» and by induction, it can be 
shown that p^|^ p/ [l-(k—l)p] when the above equation is used to 
determine the thresholds r The thresholds are determined so that 
the fraction of training samples exceeding A(k) is equal to p^it* 
Since the fraction cannot exceed one, the procedure for generating 
the thresholds at each iteration will stop before pek equals 1. 

Thus Peic « p[l-(k“l)p1 € 1 which implies k ^ 1/p. The analysis 
is similar when the string of observations is assumed to be from 
class 2, and the same upper bound on k is found. Thus N < 1/p 
and p (error decision) <1. 



XI-27 


training samples approaches infinity, the knowledge of 

p(B(k) < 3 , < A(k)), k=l,2j..,,N, becomes precise and the bias in 

EC 

the estimates of the probability of error would be expected to 
tend to zero. Also in the next section, a comparison is made of 
experimi^ntal results of the algorithm trained on one set of training 
samples with results of using a different set of training samples 
to calculate the pair of thresholds at each iteration. The use of 
a different set of training samples to calculate the pair of 
thresholds at each iteration makes the terms p(B(k) < Zj^ < A(k)), 
k=l,2,,,.,N, independent so Np is actually the expected probability 
of error. In most practical problems, however, using a differeat 
set of training samples at each Iteration would require an excessive 
number of training samples. The experimental comparison showed 
there was little effect on the experimental results of tasing the 
same set of training samples. A slight approximation was also intro- 
duced when the value calculated for the number of a training saxople 
was not an Integer and the largest integer less than the value was 
used. These approximations seem unavoidable when the number of train- 
ing samples is finite. 

If Hp Is not near the desired value, p can be varied, which will 
change N and hence Np. N Is dependent on the value of p chosen, and 
generally for smaller p, N becomes larger. N is also dependent on 
the two sets of training samples. If the training sets have a large 
overlap, N will be large. This is to be expected as the region of 
Indecision is large so more Iterations will result. 
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The probability of making an error decision on the N-th or 

last iteration is actually not equal to p as the two thresholds 

were combined into one instead of allowing them to cross. The 

actual error estimate can be made by counting the number of 

training samples for class 1 and class 2 which would result in 

N 

an error decision on the N-th iterations. Let m be the number 

®1 

N 

of training samples of class 1 above A(K) B(N) and m be the 

2 

number of training samples of class 2 below. Then 


p (error decision and end on N-th Iteration | unknown e class 1) 


+1 +1 

n ^+1 


N-1 - 

n +1 

^1 

n^ + 1 


N 




= m 


N 

e- N 
1 n 

e. 


from equation (11.14) where n 


N 


is defined by equation (11.14). 


^1 

A similar equation applies to class 2. The total estimated probability 
of error is 


p (error decision) ~ (N-l)p + m 


■■■ ° p (unknown e clasj 1) 

n 


e 


1 


N 

\ 

e. 



p (unknown t class 2) 


®2 

As p is small, Np gives an adequate expression for § (error decision) 
for most values of N and p^ 

An intuitive explanation for the closing together of the thresholds 
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can be given. In order for the algorithm to proceed to the second 
iteration, the first observation must fall between the first two 
thresholds. For a decision to be made resulting in an error on the 
second iteration, the second observation must fall in the error 
region. Let p be the desired probability of making a decision which 
ends in an error at each iteration. To obtain p on the first iteration, 
the probability of falling in the error region should be p. For an 
error decision to be made on the second Iteration, the first obser- 
vation must fall between the thresholds and the second observation 
in the error region. The probability of this is p(B(k«l) < < A(k*l))* 

v(^2 ^ region for k=2) = p. As p(B(k®l) < < A(k»l)) <1, 

p(z2 G error region for k»2) is greater than p(z^^ e error region for 
k=l) , and thus the error decision region for k«2 can afford to be 
larger than for k=l leading to a smaller overlap region. The same 
argument applies for larger k. 

The setting of the estimated p (error decision on iteration k) 
e^ual to p for each iteration was done so that an estimate of the 
probability of error for the algorithm could be obtained. This also 
resulted in a finite number of iterations for the algorithm. The 
probability of error is estimated looking from the beginning of 
the test before any samples are taken. 

II. 7 Remarks 

In treating multidimensional samples in the eKperimental results 
of the next section, the same linear ordering function was used In 
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determining the thresholds for all the iterations* Using the same 

linear ordering function throughout the algorithm may be suitable 

when the data comes from unimodal densities which have one region 

of overlap between the tv^o classes. For some sample densities ^ 

another type of ordering function might be preferable. The most 

desirable procedure would be not only to locate a plane for each 

threshold, but to determine the orientation of the plane in order 

to optimize the procedure. At each iteration, all coefficients 

{ot . } in the linear ordering function a.x +a„x 4*. ,,+a x « a 
i. X ^ s o 

would be determined instead of finding only For example, the 

number of training samples to be placed in the error region for each 
Ic k 

threshold, n and n , could be found as explained previously. For 

each Iteration, a plane would be placed through a training sample 

of class 1 so that n samples from class 1 lay on the error decision 

®1 

side of the plane and the plane oriented so that the number of training 
samples of class 2 on the same side was maximized. Such a technique 
would set p (error decision on k-th iteration {class 1) » p and maximize 
the probability of a correct classification for a class 2 observation. 

A similar procedure would be applied using class 2 training samples 
to the other plane and the class 2 error region. This method would 
give p (error decision on k-th iteration {class i)» p, 1=1,2, and 
would also minimize the number of iterations. But this technique 
requires a considerable amount of computation* Such a procedure 
might have to be repeated several times to find the best training 
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sample through which to place the plane, and then the computation 
must be done for the planes at each iteration. The choice of an 
ordering function for multidimensional sample pattern classification 
is an area in which further work can be done. Of course for scalar 
samples the question of choice of an ordering function does not 
occur. For the examples the next section, a single linear ordering 
function was thought to be sufficient considering the extra amount 
of computation required to orient a different plane at each iteration. 

II . 8 Experimental Results 

The algorithfii was tested on Gaussian random variables and on 
electroencephalogram (EEG) signals^ The results for scalar Gaussian 
samples are given in Table II. 1. Several training set sises and 
several values of the parameter p are given. The algorithm for 
each set of parameters was tested on one thousand observations from 
each of the two classes. 

The algorithm was also tested on EEG signals which are discussed 
in Section 1.2 and in Appendix 11*2. The EEG signals are from a 
stibject with a strobe light flashing in his eye or from the subject 
with the light off. It is desired to decide on the basis of EEG 
signals if the light is flashing or not. The signals with the light 
off will be called class 1 and with the light on class 2. The EEG 
responses were continuous signals of 100 millisecond duration, and 
the responses were sampled every millisecond to obtain a one hundred 
dimensional vector for each sample. A feature reduction scheme of 
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Experimental Results 





Number of 
training samples 

N " maximum 
number 

Average number 
of experimental 

Estimated 

error 

Class 1 
mean'^- , 8 

Class 2 
mcan= . 8 


P 

for each class 

of iterations 
for decision 

iterati« 
decis 
UClaaa 1.. 


rafe = Np 

experimental 
error rate 
! 

experimental 
error rate 

p « 

.01 

99 

12 

4.74 

4.54 

.12 

.0474 

.0666 

p “ 

.01 

199 

9 

4.03 

3.95 

.09 

,0444 

.0712 

p • 

.01 

399 

9 

3.93 

3.70 

.09 

• 0630 

.0741 

p « 

.01 

999 

7 

3.3 

2.90 

.07 

.11 

.058 

p » 

.005 

199 

13 

4.95 

4.74 

.065 

.0346 

.0711 

p * 

.005 

399 

13 

5.02 

5.12 

.065 

.0352 

.0718 

Ll 

•005 

999 

10 

4.33 

3.89 

.05 

,065 

,055 


Variance of both classes « 1 
TABLE II. 1 


Gaussian Experimental Error Rates 


ze-ii 
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Prabhu [1, 8], which is explained in Appendix II, I, was used to 

select a smaller number of features from the lOO-dimensional 
vector to make the testing procedure more manageable. For most 
tests tV70 features were used. Of the one hundred features, features 
eighty-five and fifty-seven were selected as containing the most 
significant information. A linear ordering function was used, 

7 = + aggXgg 

The algorithm was trained on one section of EEC data from the subject 
and tested on another section from the same subject. Table II. 2 
gives error rates on the testing samples for several parameter p values. 
The samples were taken serially as they appeared from the patient. 

Five hundred testing observations were used in all cases. 

An examination of the EEC responses showed that the samples are 
correlated and nonstationary. The independence assumption of the 
algorithm is violated. The nonstationarity means that the samples 
are not identically distributed. The correlation of the samples 
along with the nonstationarity contributed to the higher than estimated 
error rates in Table II. 2. 

To test the algorithm on data which was independent and uncorrelated, 
one thousand serial samples of EEG waveforms were mixed together so 
they no longer appeared serially as they were recorded from the patient. 
The results for the mixed samples appear in Table II. 3. The experimental 
error rates in this case agree more closely with the estimated error 
rates. This indicated that all the assumptions of the algorithm are 
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Average number 
of experlnentai 
Iterations for 
decision 

[ Estimated . 
error 
rate - Np 

Class 1 
(no strobe) 
experimental 
error rate 

Class 2 
(strobe on) 
experimental 
error rate 

Class 1 

HiM 



• 






1,9 

1,8 

.06 

,209 

.0757 

2.22 

2,55 

.07 

.186 

.0612 

3.42 

3.05 

,08 

.199 

.0548 

3.57 

4.13 

.09 

.128 

.066 

3.68 

6.25 

.06 

.11 

.0875 

3.91 

4.38 

,055 

.132 

.0789 

4.8 

6.10 

.07 

.107 

.013 

13.9 

20.8 

.04 

.0556 

.0833 

' 

■ 





of two were used for these exp'eriments . 
TABLE II. 2 


Error Rates 
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Parameters Experimental Results 


• 1 .1 . 1 . . 

p 

Number o£ 
training samples 
for each class 

N ■ maximum 
number 

of Iterations 
for decision 

Average number 
of experimental 
Iterations for 
decision * 

Estimated 
error 
rate “ Np 

Class 1 
(no strobe) 
experimental 
error rate 

Class 2 

(strobe on) 
experimental 
error rate 

Class 1 

Class 2 









p ■“ .01 

99 

10 

4.81 

5.15 

.1 

.0962 

.103 

p « ,01 

199 

10 

4.63 

5.88 

.1 

.0925 

,1295 

p « ,01 

399 

10 

3.68 

4.58 

.1 

,054 

,11 

p *> ,005 

199 

16 

5.95 

8.33 

.08 

1 

.0357 

,12 


p « *005 399 15 5.05 7,58 .075 ,0303 .166 






11-36 


not met by the EEG waveforms as they are recorded from the a 
patient. 

Section II. 6 mentioned that the estimated probability of error 
Np is biased since all the thresholds are calculated from the same 
training samples. Table II. 4 shows a comparison of experimental 
results of the algorithm trained on one set of training samples 
with the results of using a different set of training samples to 
calculate the pair of thresholds at each iteration. The examples 
are Gaussian as appear in Table II.l, and p * .01 was used for all 
the results. 



r 

Number 

training samples 
in each class 

Estimated 
error 
- Np 

Class 1 
experimental 
error rate 

Class 2 
experimental 
error rate 






One training set 

99 

.12 

.0474 

.0666 

Different training 
sets 

.09 

.0468 

.0675 

One training set 

Different training 
seta 

199 

.09 

.08 

J 

.0444 

.0947 

.0712 

.0655 


TABLE XX. 4 Comparison of Error Rates for One Training Set 

vs Several Training Sets 


The table indicates that using a different set of training samples for 
calculating the pair of thresholds at each iteration does not give 
significantly different experimental results than us^ing one set of 
training samples. The difference between the two estimated error rates 
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decreased as the number of tr allying samples Increased. 

11,9 Conclusion to Chapter II 

The algorithm presented in this chapter is a sequential 
approach to pattern classification for the case where the undt^r- 
lying probability densities of each class are unknown but training 
sets are available. When a linear ordering function is used, the 
algorithm can be viewed as a sequential variation of the linear 
separating plane approach to pattern classification. The algorithm 
used a different pair of thresholds at each iteration of the 
sequential test. The thresholds are calculated before the test 
and are independent of the observations taken during the sequential 
decision procedure. The method does require some prior assumptions 
on the pattern classes. The classes should have one region ^ 
overlap such that when the multidimensional samples of the t« . 
classes are transfomed to scalars the new scalar samples of one 
class lie largely below the scalar samples of the other class. 

For exam]^le if one class of samples Is surrounded by samples of the 
other, the classes cannot be separated by a linear transformation. 

A nonlinear transformation would have to be found. 

The algorithm presented in this chapter used a different pair ofi 
thresholds at each iteratidn of the sequential test, the neiEt few 
chapters present a sequential test where the same pair of thresholds 
is used throughout the test. 
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Appendix II. 1 - Feature Reduction and Separating Hyperplanes 


The feature reduction scheme used in this report for selecting 
i^ignif leant features out of a vector random sample of many features 
was developed by Prabhu [ 1]» [ 8]- A measure of effectiveness of 
any particular feature for classification purposes is 




(II. 1.1) 


1 i 

where pj and C7;: . are the mean and variance of the 1-th feature of 
1 ii 

class j. The criterion picks the feature that tends to maximize 
the distance between the means of the two classes while minimizing 
the dispersion about the means. Considering the combined effectiveness 
of a group of features » the correlation between the features is taken 
into account, and the criterion generalizes to 


d - - y^) 


( 11 . 1 , 2 ) 


where and are the mean vector and covariance matrix of the 
features under consideration from class j. Since the means and 
covariances of the two classes are unknown for the examples considered 
in this thesis, the means and covariances are estimated from training 
sets of the two classes. 

Let be the value of the criterion in equation (II. 1.2) when 
m features are considered. The algorithm for selecting features from a 
vector of s features, x ■= (x^, x^,...,x®) is ; 
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1.) Select the first feature x such that 


and so 


(uj - 

1 2 

a + a 
ii ii 


max 

j 


/ 1 2.2 

’"i - "i’ 

jj 3j 


d, = 


/I ,2.2 

- ^1> 

‘^ii 


2,) At each subsequent step after m features have been chosen 
and d oalculated, the increase in the criterion (d . ^-d ) is computed 

m t&T'x lu 

for each of the remaining features. The feature that gives rise to 
the maximum increase is chosen* / 

Thus the algorithm at each step selects the feature that adds the 
most to the effectiveness of the feature set already chosen where 
the effectiveness is measured by equation (11.1*2). The feature 
selection procedure is not truely optimal in that the subset of the 
best m features is not necessarily a subset of the best iiri-l features. 
To be truely optimal, the algorithm must search over all possible com- 
binations of m features at each step. But such an esdiaustlve search 
becomes quickly infeasible as the total number of features increases. 

The separating hyperplane that was used for transforming vector 
samples into scalars in this report is 
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where 

o 

3 = . i (H^ _ - U^) . (II. 1.3) 

O £. 

The weighting vector maximizes 

fg'^Cu^ - u^)l^ 

+ r^)« 

which is interpreted as the ratio of the distance between the means 

of the classes to the dispersion of the classes along the direction a. 

11 2 2 

If the classes are Gaussian, ,S ) and N(v ,Z ) respectively, 

T 

then a X + 3 is the separating surface that minimizes the probability 
o o 

of miaclassif icatlon with the prior probabilities of each class being 


equal . 
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Appendix II. 2 - EEG Data 

A detailed discussion of the EEG data is given by Prabhu [ 1]> 
and much of the description presented in this appendix Is based on 
Prabhu' s discussion. An electroencephalogram (EEG) is a recording 
of electrical activity of the brain. The electrical activity is^ 
of the order of microvolts and is measured by electrodes placed 
on the surface of the scalp. While the precise origins of the 
electrical potentials is not yet fully understood, it is generally 
agreed that the potentials result from the synchronous activity of 
a large number of cells. In order to maintain some uniformity in 
the EEG measurements, it is necessary to keep the patient in the 
same psychological state during different recordings. When the 
recording is made from an alert patient in a darkened, soundless 
room cut off from external stimuli, the EEG is said to be "spontaneous." 

Since an EEG recording is the result of the combined activity of 
many cells, an EEG signal can be considered to be a sample from a 

A 

random process. An example of an EEG is shown in Figure II. 9. The 
EEG has been observed to have several dominant frequencies with the 
most dominant between 8.5 and 10.5 c.p.s. This is called the alpha 
frequency. An EEG record can be split into equal parts where the 
length of each part is equal to the period of the alpha frequency. The 
dotted line in Figure II. 10 shows the average signal that results 


4c 

Figures II. 9, 11,10, and 11,11 have been taken from Prabhu [1 ]. 
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AVERAGE SIGNAL 
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frors averaging these parts. 

While the spontaneous EEG represents the electrical activity 
of the hrain when no visual or auditory stimuli is present, a 
different EEG signal can be produced by a flash of light into 
the patient’s eyes through closed eyelids. If a light is flashed 
periodically at a frequency very near the alpha frequency, then 
the EEG has the affect of being driven into resonance. The 
EEG signal between two consecutive flashes is called an "evoked” 
response, and the solid line in Figure H. 10 shows the average 
signal of the evoked responses. 

The classification algorithms tested in this thesis attempted 
to distinguish between spontaneous EEG and evoked EEG. A signal over 
one period of the alpha frequency was taken to be one sample. 

The’ EEG record used in this thesis was obtained from NASA through 
the former Electronics Research Center, Cambridge, Massachusetts. A 
recording of ten minutes duration was done on a single person in one 
sitting from a pair of electrodes located in the left occipital- 
parietal area. Both spontaneous and evoked responses were obtained in 
the one recording. A stroboscopic light was flashed into the eye of 
the subject through closed eyelids. The frequency of the flashing 
was tuned to his alpha rhythm which was approximately 10 c.p.s., 
and thus a flash occurred every 100 milliseconds. The stroboscopic 
light was blocked periodically from the eye of the subject, and thereby 
giving rise to spontaneous BEG. Thus the entire EEG record was com- 
posed of blocks of evoked EEG driven at the alpha frequency and of 
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spontaneous EEG. The length of each block was about 25 seconds. 

To facilitate digital computer work, each of the waveforms was 
discretized by sampling the amplitude every millisecond. Thus each 
response between two successive stroboscopic stimuli would be 
expected to have 100 sampled values. In practice, it was found that 
the number sometimes exceeded 100 due to drifts in the stroboscopic 
frequency. In order for the pattern vectors to be of uniform dimension, 
only the first 100 values were retained. 

In the experimental work of this thesis, only a few of the 100 
features in each digitized waveform were used. The features were 
selected by the feature reduction procedure explained in Appendix II. 1. 

In order to Illustrate the degree of o'^erXap between the two clashes 
of EEG signals, Figure II. II shocks a plot of samplers from the two 
types of EEG. The samples are two dimensional with the features 
being the first two selected by the feature reduction proceiiure. The 
line in the figure is the separating plane for the two features where 
the equation of the plane is also explained in Appendix II. 1. Prabhu [ 1 ] 
found that there was about 20% error rate in classification decisions 
made on single observations with the sBparating plane. 
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Appendix II *3 - Order Statistics 

This appendix will define the notion of an order statistic 

and present some of the properties of such a statistic. Some 

references that can be consulted on order statistics are Hogg and 

Craig [9 ], Wilks [10], Fraser [11], and David [12]. 

Let X-, be n independent random variables identically 

12 n 

distributed with absolutely continuous distribution function F(x) 

and with probability density function f(x). Rearrange X^.X^j.-.jX^ 

in ascending order so that X. < X • sC X , , For convenience 

^2 n 

relabel the set as Y_ * X. , Y« » X. , . . . , Y “X. so that 

1 1 . 2 i„ n 1 

12 n 

Y- < Y- ^ ^ Y . Y.s i=l,2,...,n, is called the i-th order statistic 

12 n 1 

of the random sample X^,X 2 > . • • 

The joint density function of ^ * *^n shown to be 

{ nl f (yj^)f (y 2 > • • “ f 

^ ^ *** ^ (II.3.1) 

0 elsewhere 

From this joint density, it follows that the marginal probability 
density function of y^ is 

and the joint density of y^^ and y^, i < j, is 
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n. 




in-j 


gij(yi.yj)^ 




0 


yi< yj 


eXsebrhere . 


(II. 3. 3) 


The distribution function of F(x) will now be considered. 

Let X be a random variable having an absolutely continuous distribution 
function F(x) and probability density function f(x). Then the random 
variable Z =» F(X) has a uniform distribution on the interval (0,1). 

This will be shown under the assumption that f(x) is positive and 
continuous for a < x < b and zero elsewhere. The distribution 
function of X can be written as 





X ^ a 
a < X < b 
X ^ b 


Then for the transformation z »» F(x), dz/dx « f(x) for a < x < b, 
and 


f(x)^ 




f (x) 


dz/dx 


f (x) 


f (x) 


for a < X < b 


Thus the probability density function of Z « F(X) is 
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h(z) = 



0 < 2 < 1 
elsewhere • 


(II. 3.4) 


Since Z « F(X) is a random variable with a uniform distribution on 
the interval (0,1), it follows that p(F(X) < v) * V. That is, 
if p is the probability that a future sample will fall below the 
random variable X, then the probability that p does not exceed v is 

Consider again the random sample Xj^,X 2 , . » . ,X^ and the set of order 

statistics for this random sample Consider further the 

set of random variables F(X.. ) ,F(X.) , . . . ,F(X ). Since F(x) is nondecreas- 

± z n 

ing in x, it follows that F(Y^) < ^^ 2 ^ ^ ^ hence Zj^ » 

^’(Y.),Z«*»?(Y^),...,Z *F(Y ) are the order statistics of the random sample 
F(X ),F(X 2 ),...,F(X ). Since F(X) is uniform on the interval (0,1), the 
joint density function of Zj^,Z 2 > • • • >2^^ is found from equation (II. 3.1) to be 


h (2^ , 2 


2 , e • e , 


( 0 


0 >C Z, < z* < « 2 < 1 

^ " (II. 3. 5) 

elsewhere 


Similarly, the marginal density of joint density 

of Z. “ F(Y.) and Z. «F(Y,), i<j, can be found from equations 
11 j j 

(II. 3 . 2) and (II. 3. 3) 




( 


n. 


(k-l):(n-k)l "^k 
0 


k“l -- ^n-k 

z. (1-2, > 


0 < z^ < 1 
elsewhere 


(11,3.6) 
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0 


0 < z^ < z^ < 1 


elsewhere * 


(II. 3. 7) 


For the order statistics ‘ intervals (-“sy^ls 

(y-t >y^3 > • * • » (y >+“) are called sample blocks. The probabilities 
of a future observation falling in each of these sample blocks are 
F(yj^) »F(y 2 )”^(y 2 ^»* * respectively. F(yj)-F(yj_j^) is called 

a coverage of the sample block (yj_ 2 ^ 9 yjJ* The distribution of 
the random variable Zj-Z^ * F(Yj)-F(Y^), 1 < j, will now be considered. 

It can be sh^jwn that the random variable has the same distribution 

as the random variable Thus from equation (II. 3. 2), Z^'-Z^ “ 

F(Yj)“F(Y^) has the probability density function 


Hi 


(j-l-DKn-j+i): 




h(v)' 


0 < V < 1 


elsewhere 


(11.3.8) 


It is noted that this is a Beta distribution B(j-i,n“j+i+l) . The 
mean and variance of F(Yj)-F(Y^), i < j, can be calculated to be 


EIF(Y^)-F(Yj)] - 


<11. 3.9) 


Var[F(Yj)-F(Y^)] 


(04-1) ^(ttf2) 


(II. 3.10) 
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In particular, EtF(Y, . - )-F(Y . ) ] = l/(n+l). Thus the order statistics 

partition the sample axis into iH-1 parts, and the expecte<* probability 

of a future observation falling in each part is l/(n+l). 

The theory of sample blocks and coverages can be extended to 

more than one dimension by using ordering functions. The concept of 

ordering functions will be introduced by considering a single ordering 

function to partition the s-dimensional sample space. Let 
12s 

(X^ ,X^ , . , . ,Xj ) , j=*l, 2 , . . . , n* be n independent s-dimensional random 

12 s 

variables distributed as the random variable X » (X ,X ) with 

12 s 

a continuous s-variate distribution function F(x ,x , . . . ,x ) c If 
12 & 

W » t(X ,X ,...,X ) is a random variable with a continuous distribution 

12s 12s 

T(w), then t(x ,x ) is an ordering funetion. W. * t(X^ . ,X^) , 

j»l,2,...,n, constitutes a random sample from a population whose 

distribution function is T(v), and the random sample can be ordered. 

Let the order statistics for the random sample (W, ) be - 

(W. ,W. ). Then the j-th sample block is B, *** {x|t(x. ) < t(x) 

^1 ^2 n ^ j»l 

< t(x. )} where is the s-dimensional sample such that w. ■ t(x. ). 

j j j 

Figure 11.12 provides an illustration in two dimensions. The coverages 

of the ttfl sample blocks are ^*^2 * "^^^1 

1 2 1 
^ h " T(W^ )-T(W^ ) is the 

n n-1 n ^ j J-1 

probability that a future observation will fall in the j-th sample block. 

It can be shown that for the coverages * * ’^n-fX 

r coverages has a Beta distribution B(r,n-fl-r). Thus the expected 

value of a future observation falling in any r, r < n, of the sample blocks 

is r/CrH*!). 
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FIGURE It. 12 


Exant|ile of Ordering Function 
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For a random sample of n random variables, it is also possible 
to partition the sample space into n+1 sample blocks by using as 
many as n different ordering functions. It can be shown [7], 

[ 10] that the coverages of each sample block, which are the probabilities 
of a future observation falling in each sample block, still follow the 
Beta distribution. Thus the expected value of the probability of a 
future observation falling in any r, r ^ n, of the sample blocks is 
r/ (n+1) . 



CHAPTER III 


A SURVEY OF DENSITY FUNCTION ESTIMATES 

Section 1,6 of the introductory chapter mentioned that a 
classification method will be presented that uses density estimates 
in a sequential test called the sequential probability ratio test 
(SPRT)o The chapters that follow this one examine density function 
estimates that are well suited for the SPRT and formulate an 
estimated version of ^he SPRT from the density estimates. Before prof* 
ceeding to such a development, this chapter presents a survey of 
several known techniques for estimating density functions. 

111.1 Assumptions 

In discussing the density estimates presented in this report, 
the following assumptions about the samples from each class are made: 

1) that the samples are scalars 

il) that the samples are indepeikdently , identically distributed 
in each class 

iii) that the samples of each class are of the continuous type. 

(the footnote in Section II. 2 defines a random variable of the 
continuous type.) 

111. 2 Motivation for Density Function Estimates 

In order to get a clearer idea of what is involved in estimating a 
density function, the definition of a density function will be reviewed. 
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The probability distribution function F(x) of a random variable x is 
defined as F(x) » p(ri ^ x) and the density function f(x) is the derivative 
of F(x) , f(x) ^ . In the pattern classification procedure^ dis- 

cussed in this report , F(x) is unknown. The distribution function F(x) 
can be easily estimated from training samples by taking as the estimate 
the fraction of samples less than x (remember that only scalar samples 
are being treated in this chapter.) As the number of training samples 
approaches Infinity, this estimate of F(x) approaches the true F(x) with 
probability one and In the mean square. Cramer [6] and Rao [13] are among 
many authors who discuss this estimate. 

While the estimate of F(x) is straight forward, it is the estimate 
of f(x) F'(x) that is actually needed. The definition of a derivative. 


lim 

h-^O 


F(x+h)-F(x-h] 

2h 


f<x) , 


(III.l) 


can be used to motivate methods for estimating f(x). Equation (III.l) 
can be written more generally in terms of probabilities as 


lim p (observation c A) 
A-K) A 


f(x) 


(III. 2) 


where A is the width of some interval that contains x. Thus f(x) could 
be estimated by first approximating f(x) as in the left hand side of 
equation (XtX.l) or (III. 2) and then estimating the approximation from 
training samples. Most methods which have been developed for estimating 
f(x) Involve using equations (III.l) and (III .2) in one of two ways: 
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1) one approach is to specify the Internal width A and 

and let the numerator p (observation e A) be a random variable 
to be estimated from the training samples 
li) another approach is to specify the numerator p (observation £ A) 
and to specify a certain number of training samples to be con- 
tained in the interval A so that the denominator takes the value 
of that interval width A which contains the specified number of 
training samples « 

In 1) the interval width is specified and in ii) the training samples 
determine the interval width* Rosenblatt [34], Whittle [15], and Parzen 
[16] have written about i) and Loftsgaarden and Quesenberry [17] about ii) . 
Cover [18] in a general discussion of nonparametric pattern recognition 
methods briefly discusses the use of the Parzen density estimate in a 
Bayes decision rule and mentions the estimate of Loftsgaarden and Quesenberry. 
The remainder of this chapter will discuss several density estimates stressing 
properties which are important to sequential decision methods where, of course, 
a string of observations are considered at once. Some considerations to 
be described are storage requirements, complexity of calculations, and 
continuity of the density estimates. 

111*3 Density Models That Specify Bln Width 
111*3.1 Fixed Bin Model 

Perhaps the simplest density function estimate is the estimate 
that is often referred to as a histogram and what will be called the 
fixed bin model in this report. Referring to equation (111.2), this 
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density model sets the denominator and estimates the numerator. The 
sample axis is partitioned into a number of fixed intervals as in 
Figure III.l. The density estimate for an x in any interval is the 
fraction of training samples in that interval divided by the Interval 
width. Let 


then 


n be the number of training samples 
k be the nuniber of bins 

1»1,2, * . . ,k*fl be the bin boundaries 
m^ be the number of samples in the i-th bin 
(or in interval (Y^,y i+1^^* 


f (x) 


“i 


0 


for Yi < X < Yi+i 

for X < Yi or X > Yi *1 
'1 'kfl 


(III. 3) 


By its construction, estimate (III. 3) Is a step function. Since the intervals 
are specified by the choice of the Y^’s, only the Y^*s and the fraction 
of samples in each bin need be stored while using the estlmaten Thus, 
the estimate Is calculated for all x at once, and the whole density estimate 
is stored for future use. One question that must be answered In formulating 
this estimate is that of where to place the bins along the sample axis. If 
the bins are vide or are placed where there are few samples, the estimate 

/V 

f(x) may be inaccurate, and poor use will have been made of the training samples. 

Hughes [19] discusses the effect of the number of training samples and 
the number of bins on the mean accuracy of a Bayes decision rule which uses 
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m; 


f(x) = 


‘/n 



Exseaple.' of Fixed Bin Density Estinate 
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fixed bin density estimates. In Hughes* paper, the placement of the 
training samples in the bins was given a uniform prior distribution 
in order to consider all possible combinations in which the training 
samples might occur in the bins. Abend and Harley [20], Chandrasekaran 
and Harley [21], and Hughes [22] amend the results of this paper by 
using the training samples to provide posterior estimates of the 
probabilities of an observation following in each bln so that the estimates 
will be consistent with the uniform prior distribution. Patrick and 
Hancock [23] examine the Bayes decision rule for problems where the train- 
ing samples are available but their classification is unknown. In dis- 
cussing the situation when no information is known about the density 
functions, they show that a fixed bin model can still be used to estimate 
the density functions. 

III. 3.2 Parzen Model (Specified Sliding Bin) 

Parzen [16] estimates the density function at x by centering a 
bin of specified width about x. Similar to the fixed bin model, Parzen’ s 
density estimate specifies the denominator of equation (Ill.i) and 
estimates the numerator. As the bin (x-h,acf'h) is always centered at the 
X for which the density estimate is desired, the mechanism of the model may 
be viewed as a sliding window of width 2h. Figure XXI. 2 illustrates the 
model. The estimate at any x is 


f Cx) 


fraction of training samples in (x-h^xfh) 

2h 


(III, 4) 


,The model Is similar to the fixed bin model in that the bin width is 
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/ 


f(x) • 


fraction samples in (i 
2h 


r 


1 

1 

! 

1 

1 

1 

1 

! 

1 

1 

1 

1 




-h, xth) 
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specified, and agAln there is the question of how wide to set the bin. 

It may be that for some x the interval (x“h,x+h) does not contain a 
great enough percentage of the training samples to provide an accurate 
estimate of f(x). Given an x, it may be necessary to change h until a 
satif.factory nuniber of samples is contained in (x-h,x+h). Parzen and 
Rosenblatt have developed formulas for h as a function of the number of 
samples sc that h minimizes the mean square error of the estimate, but 
these expressions require a knowledge of f(x) and usually f"(x). The 
utilization of this model in a decision algorithm requires that all 
training samples must be stored. The estimate is then calculated for 
each X. The estimate in equation (III. 4) is not continuous, but the 
general formula for the Parzen estimator presented in the next paragraph 
can provide a continuous estimate. 

Let there be n training samples {x^}, i-l,2,...,n. Then Parzen*s 
model can be expressed in a general formula 


f(x) 


1 


nh(n) 


j '(nsr) 


(III. 5) 


where 


sup |K(y) I < " 

«ieo<y<oo 


r |K(y'/|dy 

J ..oo 


< OO 


llffi |yK(y) I * 0 

y-vo 


K(y)dy - 1 
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are conditions necessary for equation (III. 5) to asymptotically 
be an unbiased estimator of f(x). The estimate (III .5) converges 

to f (x) in the mean square if h(n) -*■ 0 and tih(n) *> «> as n •»' 

The convergence condition h(n) 0 may be interpreted in equation 
(III *4) as letting the interval width shrink to zero while the condition 
nh(n) requires the number of samples in the interval to approach 
infinity. 

If 

i i for |yl < 1 

(III. 6) 

0 for |y| > 1 

then equation (III. 5) agrees with equation (HI. 4), The Parzen estimate 
is continuous in x for other choices of K(y). An example of K(y) which 
results in a continuous estimate is to take 

. 3 2 

K(y)- — (III. 7) 

For this choice of K(y), estimate (III. 5) is the sum of n Gaussian 
densities when each Gaussian density is centered about a training sample* 
Van Ryzin I 24] has developed a classification procedure that 
makes use of the Farzen estimator in Bayes rule. 

111.4 Density Models where the Bin Width is Determined by Training Samples 
III. 4.1 Nearest Neighbor Density Estimate (Variable Sliding Bin) 
Loftsgaarden and Quesenberry [17] have developed an estimate that 
employs an interval which Is centered at x and whose width is determined 
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by the training samples. Unlike the fixed bin and Parpen models » the 

estimate of lof tsgaarden and Qnesenberry specifies the numerator of 

equation (III. 2) and estimates the denominator. In Section XIX. 3. 2, 

it was mentioned that the Parzen model could be viewed as a sliding 

bin of specified width centered at x. Similarly the Loftsgaarden 

and Quesenberry estimate can be viewed as a sliding bin of variable 

width. An Integer il(n) is chosen (n is always taken to be the number of 

training samples), and the Jl(n)i-th nearest training sample to x, called 

is found. The interval width is then taken to be 2jx-x^^^jl, and it follows 

that the fraction of samples inside the interval is (il(n)“l)/n. The estimate 
is 




(III. 8) 


where Jl(n)-th nearest sample to x according to the 


distance measure |x"y|. Figure III. 3 provides an example. The 

estimate (III. 8) converges to f(x) in probability if JZ-(n) ->■ “ 

and £(n)/n -*■ 0 as n The condition £(n)/n -► 0 lets the width w|x-x^(n) [ 

shrink to zero while the condition il(n) -*■ allows the number of 

training samples contained in the interval to approach infinity. 

Metrics other than |x~y| may be used in the estimate. In general, 
if the metric d(x,y) is employed, the estimate Is 




(III. 9) 


where x.. ^ £(n)'-th closest training sample to x according 

to the metric d(x,y). 




r 


A 



(|(n)-l)/n 
Z I 


-W- 


X K 


jx X j K K 


Width of interval 


-X- 


f(n)-th closest 
sample to x 


FIG. m.3 

Example of Nearest Neighbor Density Estimate 
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The estimate of Loftsgaarden and Quesenberry is related to 
the nearest neighbor methods of pattern recognition [25, 26]. In 
the nearest neighbor (NN) methods, an observation is classified 
into that class which is most heavily represented among some 
specified number of nearest neighbors of the observation. Since the 
estimate of Loftsgaarden and Quesenberry involves finding the 
£(n)”th nearest neighbor to x, it will be called the nearest neighbor 
(NN) density estimate in this thesis. 

The NN density estimate is continuous in x. All training 
samples must be stored in order to use the estimate, and then for 
any particular sample value x, the estimate is calculated. In the NN 
estimate, the bin centered at any x always contains a specified number 
of training samples; whereas in the Farzen estimate, in which the bln 
width is specified before hand, the interval may contain so few 
samples that the estimate can be quite inaccurate. This problem of 
bin placement is discussed further in Sections III*^ and IV.3«2, 

111,5 Accuracy and Storage of Density Estimates 

The purpose of studying density function estimates in this ^^port 
is to examine their use in sequential classification algorithms. In 
practical decision problems « the amount of storage available for storing 
the density estimates during computation is limited. While limiting 
the storage of the density estimate is necessary, the accuracy of the 
estimate is thereby decreased. 

In considering the accuracy of estimates of continuous density 
functions, the accuracy may be divided into two parts, one of a 




deterministic nature and the other of a random nature. Density 
estimates make a deterministic approximation of f(x) in the neigh- 
borhood of X and then estimate the value of the approximation from 
the training samples. Thus, the training samples are not used to 
estimate f(x) directly but rather to estimate some deterministic 
approximation to f(x), which Is a function of F(x), such as 

. (iii.io) 

2h 

The total accuracy of the estimated density depends on how accurate 
an estimate of the approximation can be obtained from the training 
samples (the random part) and on the accuracy of the approximation 
(the deterministic part.) 

For example in the Farzen estimate of equation (111,4), the density 
function is approximated by [F(x4*h) - F(x-h))/2h. The interval width 
2h is specified, and then F(x+h) - F(x-h) is estimated from the training 
samples. No matter how accurately F(x+-h) - F(x-h) is estimated, the 
accuracy of the Parzen estimate will be low if [F(»fh) - F(x-h)]/2h is 
a poor approximation of f(x). Likewise, If P(»fh) - F(x-h) is poorly 
estimated, the density estimator will be inaccurate even though 
[F(x+h) - F(x-h)]/2h may accurately approximate f(x). Both the deter- 
ministic and random parts of a density estimate must be good for the 
total estimate to be accurate. The conditions for convergence of equation 
(III, 5) express this phenomenon. The condition h(n) ^ 0 requires the 
the interval width to shrink to zero and thus the deterministic 
part to converge; n causes the estimate of F(x + h) - 
F(x - h) and hence the random part to converge. Both the 
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random and deterministic parts must converge simultaneously. The 

condition nh(n) <» means that as the Interval width shrinks to 

zero the number of samples Inside the interval approaches infinity. 

In general, the deterministic part of the accuracy depends on the bin 

size and the random part on the number of training samples including 

the number of samples inside the Interval. The choice of the bin 

« 

size is a trade off between making it small to provide determiniitic 
accuracy or large to give random accuracy by containing a large 
fraction of training samples. Rosenblatt [14J shows that density 
estimates must be biased for a finite number of samples. The bias 
arises from the deterministic approximation of f(x). The estimate of 
the approximation can be unbiased, but the error in the approximation 
still remains. 

Since the Intervals of the Parzen and NN estimates are centered 
at X, they are more accurate in the deterministic sense than the 
fixed bln model. But the Parzen and NN methods require storage of 
all training samples for good random accuracy. The fixed bin model 
sacrifices some deterministic accuracy but retains good random accuracy 
in limited sl^nrage. 

This chapter has discussed some properties of different density 
estimates, but a more detailed discussion will be presented in the 
next chapter in connection with a new proposed estimate. The various 
properties of the density estimates discussed so far seem to be determined 
by two factors, 1.) whether the bin width is specified or is set by the 
training samples and 2.) whether the density function is estimated for 
all X at once and the total estimate stored, or all the training samples 
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are stored and the density is estimated separately for each x. 

Table III.l lists the density estimates in a matrix form and shows 
how the various estimates are related to these two factors* Also 
listed are properties of the density estimates as determined by the 
two factors. 

There is one blank position in the two by two matrix in Table III.l, 
and the next chapter will propose a density estimate that fits into 
the blank slot. The density estimate will combine some properties of 
the NN and fixed bln estimate^ as the blank position in the matrix 
Indicates it should. The model will be a step function so the small 
storage advantage of the fixed bin model will be retained. But the bln 
widths and positions will be determined by the training samples so 
that the bln placement will result in an accuracy greater than the 


fixed bin model. 


Factor 1 


Bln Width 
Set by 
Training 
Samples 



Properties Influenced by Factor 1 


In f (x)s:p(x£A)/A 

deucminator numerator 
specified, specified, 
numerator denominator 

estimated estimated 


Difficulty 
of bin size 
choice^ 
more less 


Convergence 
conditions as 
if training 
samples**®* 

specified bln 
width->0 at such 
a rate that # 
samples in bin**®® 

if samples spec** 
if led in bln**« 
at such a rate , 
that bin wldth-K)‘ 


1« In Total Point Estimate, the density function is 
estimated for all x at once, and the total estimate 
is stored. 

2. In Single Point Estimate, all training samples are 
stored and the density is estimated separately for each x. 

3. These numbers Indicate references in the bibliography. 

4. When the bln width is specified, there is a problem of 
how to choose it initially so as to contain a number of 
training samples that would give a reasonable estimate. 

In letting the training samples set the bln width, a reason- 
able estimate is more readily obtained. 

5. The number of samples specified in the bln-*®® but a rate 
sufficiently slower than the total number of training 
samples-*®* in order that the bin width that contains the 
specified number of samples-*0. 


Bin, Parzen, and NK Density Estimates 
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CHAPTER IV 


RANDOM BIN MODEL 

ChapL^T 1X1 discussed three density estimates; the fixed bln 
models the Parzen model, and the nearest neighbor model* Both 
the fixed bln and Parzen models have a computational disadvantage 
in that the bin width is specified before the density is estimated 
from the training samples* It is not known where to position the 
intervals in relation to the distribution of the training samples, 
and it is possible that the bin width could be set so wide as to 
contain half or even all of the training samples. If an interval 
contains too large a percentage of samples, the bin width can be 
changed and the density estimate repeated* But iterating on the 
interval width complicates the estimation of the density* The NN 
estimate overcomes the problem of setting the bin size by determining 
the interval width from the training samples. The number of training 
samples i to be contained in a bin is specified, and the bin size 
is determined by the width necessary to contain this number oi samples 
Different values of Z result in different estimate accuracies, but 
whatever percentage of samples for a bin is specified, the bin width 
will be reasonable since it is determined by the dishrlbution of the 
training samples. The density estimate presented in this chapter 
combines the property of the NN estimate of ^ilacing the bins by the 
training samples with the low storage advantage of the fixed bin model 
Since the new density estimate has a step function form similar to the 
fixed bln model and at the same time determines the bin widths from 
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the training samples, the estimate is called the random bin density 
estimate. 


IV. 1 Presentation of Random Bin Estimate 

The random bin model attempts to place bins so that the pro** 
babllity of an observation falling in each bln has a specified value. 
Usually the bins are positioned so it is equally likely an observation 
will fall in any bin is illustrated in Figure IV, 1. Let fcfl be the 
number of bins. The bin widths are determined so the probability of 

X 

an observation falling in any bin is • Then 


f(x) 


1 

k+1 



estimated width of 1-th 
bln such that p<xei-th bin) 


fcfl j 


for xel-th bln 
(IV. 1) 


The bin boundaries are calculated from quantiles and quantile 
estimates. The next few sections discuss quantiles, their estimates, 
and a density estimate based on quantiles. The assumptions on the 
data listed in Section III.l still hold in the following discussion. 

The assumptions were that the samples are scalars, identically and 
independently distributed in each class with absolutely continuous 
distribution functions. Conditions for the density estimates dis- 
cussed in this thesis to converge to the true density f(x) require 
that f(x) be continuous at x. By the assumption of absolute continuity 
of F(x), the number of discontinuous points of f(x) is finite in any 
finite interval. Since in this report the purpose of obtaining 
density estimates is to classify observations, a density is estimated 
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f(x) 
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only at values of a given observation. The probability of an 
observation occuring at a discontinuity is zero. Thus, the assumption 
of absolute continuity of F(x) is not restrictive for classification 
purposes. The convergence conditions of the random bln density 
estimate that will be presented in Theorem IV, 2 also assume f'(x) 
is continuous in a neighborhood of x and f(x) ^ 0 at x. Again, 
as long as the number of points at which f (x) is not continuously 
differentiable or f(x) equal to zero is finite in any finite interval, 
the conditions are not restrictive. 


IV. 1.1 Definition of Quantile 

The p*-th order quantile, labeled C a of a distribution function 

P 

F(x) is any value of x such that F(x“C ) “ P* See Figure IV. 2, 

P 

In this report, ^ is assumed to be unique for any p. Since x is 

P 

a random variable of the continuous type and hence F(x) is absolutely 

continuous, the existence of C for any p is guaranteed. The further 

P 

assumption of the uniqueness of means that F(x) is strictly increas- 
ing in X. 


IV. 1.2 Set of Quantiies 

For any integer k, a set of k quantiles ^ ^ 2 * 

kfr k+r 

can be defined such that for any two consecutive quantiles 





k+l fcfl 


9 


<W - 

fc^-1 


F a , ) 

k+i 


k+l 


(IV, 2) 


Figure IV. 3 provides an illustration. Thus the set of k quantiles 
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partition the sample axis so that the probability of an observation 

falling in any partition is . 

1V>1.3 Defining a Densit^ ^ from Quantiles 

Let the X be a random variable with distribution function 

F(x) and density f(x}» and let ^ ^ » ^2 be a set 

k+1 k+1 k+1 

of k quantiles. An approximation of f(x) is 

r 0 * < I , 


f (x) 
approx 


k+1 kfl 

^1+1 ^ 1 

k+1 “ k+1 


? . < X < 

k+1 k+1 


X > 5 


(IV. 3) 


If X Is known to be distributed over an Interval (a,b) then f (x) 

approx 

can be written as 


F(5 1 )/(E 1 -a) 


X < a 


a < X < C 


f (x) 

approx 


F(g^+l)-F(g^ ) 

k+1 k+1 

k+1 k+1 


C . < X ^ ? 


■ 1+1 

k+1 




C X ^ b 


By equation (IV. 2>, the numeratois of equation (IV. 3) are all equal to • 

lc+1 
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If k is allowed to approjfich infinity and for any x one chooses from 

the set of k quantiles ^ ^ ^ ^ , ««• *^k^ of quantiles 

kfl hH k?T 

just below and just above x, the approsdjiation converges to f(x). This 
is shown below: 

Theorem 1 : Let X be a random variable with an absolutely continuous 

function F(x) and with probability density function f(x). Let 

1 » C 2 ^ ® ^ quantiles from F(x)« Define 

k+1 k+1 k+1 


0 


X < 

fc +1 


f 

approx 


(x) 


l/(k4-l) 


k+1 


k+1 


< X < 

k+1 k+1 


I ° 


X > 


^ k 
k+1 


(IV. 4) 


Then at all x for which f(x) is continuous 


lim 

k-«® 


f 

approx 


(x) 


f(x) 


(IV.5) 


First convergence of a more general form of equation (IV. 4) 
will be proved. 

Lemma I s Let F(x) be an absolutely continuous function* For a 
constant x, let a^ be a sequence^ of real numbers such that X x 
and a, X as k *** <»g and be a sequence of real numbers such that 
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^ X and -► x as k -► ». Then at all x for which F’(x) is 
continuous 


F(b.) - F(s ) 

lin ^ — - -■ - F'(x) . (IV. 6) 

k ^ 


P roof t Since F(x) is absolutely continuous » 




A Hb^)-F(a^) 


•’k-^ 


^ I 
I 


k 

*k 


F'(u)du 


(IV. 7) 


Subtracting F'(x) from both sides of equation (IV. 7), 

rb. 


AF. -F’ (x) - 


I (F* 


(U)-F'(x))dp 


(iVoe) 


By the assumption of continuity of F(x) near x, for all e > 0 there 
exists a 6 > 0 such that (y)-F' (t) | < £ if |y-t| < 6 • Given 
an c, choose k^ such that < 6^ if k i k^. Then it is observed that 

|p-x|<6^ if k ^ k^ and a^ < li < bj^ (remember < x < b|^)o The condition 
Iy-xj<6 and continuity of F* (x) imply 


P'(W) - F'(x>| < e 


(IV.9) 


Substituting equation (IV.9) in equation (IV.8), 

|AFj^ - F'(x)| < edv - e (IV.IO) 

^a^ 

and so |AF|^-F* (x) | < e if k ^ Thus for any £ > 0» there exists 
a k such that |AF.-»F(x)j < e if k ^ k • and Lemma 1 is proven* 

b K E 
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Proof of Theorem 1 ; Lemma 1 Implies Theorem 1 if C . in equation 

k+1 

(IV* 4) can be identified with and with bj^. By construction 

fcfl 

of equation (IV. 4), C * < K < remains to show that 

k+1 k+1 

-► X and X as k It has been assumed that for any 


k+1 


k+1 


p the p-th order quantile Cp Is unique, and hence F(x) is strictly 

increasing in x- So ? . C k < implies 

k+1 k+1 


F(C j ) < F(x) < F(€^) 
k+1 k+1 


(IV. 11) 


Now ^ " kfl * £ > 0, there exists a k^ such 

k+1 k+1 

1 

that 'rrr < £ if k a k • Thus 
k+1 € 


) - Fa 4 ) 1 - 0 ■ 


(IV. 12) 


kfl 


k+1 


Equations (IV. 11) and (IV. 12) imply 


lim F(C . ) - F(x) 

fch: 


(IV. 13) 


and 


11m Fa .) « F(x) 

ibr 


(IV. 14) 


Since F(x) is strictly increasing in x, equations CIV.13) and (IV. 14) 
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imply 


Xim 

k->co 



X 


(IV. 15) 


and 


lim 

k-H» 


i±i 

k+1 


(IV. 16) 


The sequence ^ 


, ^ has the properties of and those of 

k+l k+1 


and so Leinma 1 implies Theorem 1. 


IV. 1.4 Quantile Estimates 

Equation (IV. 4) presents a density approitiml^ftion containing 
quantiles. If F(x) is unknown, the quantile can be estimated from 
training samples. A deasity estimate can be constructed by replacing 
the quantiles in equation (IV. 4) with quantile estimates. 

The p»th order quantile of a distribution function F(x) can be 
estimated from training samples with order statistic theory. Let 
n Independent observations of a random variable X be arranged in 
ascending order. 


X. < X. < ... < X, • (IV. 17) 

^1 ^2 n 


Relabeled the samples for convenience 


^1 " ’ ^2 “ *1 • •** * ^n * ‘ (IV. 18) 


(yi^y^foofy^) iB, as is mentioned in Section II .3, a set of order 

statistics. An estimate of the p-th order quantile ^ is 

P 
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“ ’^[np]+l 


(IV. 19) 


where [w] is the largest integer less than or equal to w. If np is 

an integer, choose any value in the closed Interval between and 

^np+1 distance between the two neighboring order statistics 

y and y tends to zero as n approaches infinity. A motivation 
np np+i 

A, ^ 

for Cp is that the fraction of samples less than ^p is near p and 


from order statistic theory (see Appendix II .3) E[F(Cp)] 


[np]+l 


n+1 


which is approximately p. Rao [13] shows that the estimate C 


approaches as n ® with probability one. The distribution of 
is shown by David [12] to be asymptotically Gaussian with mean and 


variance 


p(i-p) 

n(f (5p)] 


j where for np equal an integer ^p is taken to be 


y^p to simplify the indeterminate case. 

The set of quantiles ^ that appear in the 

kfl ffl k+r 

density approximation of equation (IV, 4) can be estimated from equation 
(IV, 19) 


k+1 


rJili+i 

^k+1^ 


for j»l,2,,..,k 


(IV. 20) 


IV. 1.5 Estimating the Density Function f(x) 

If a set of k quantile estimates ^ 


k+1 kfl 


k+1 


determined by equation (IV. 20) serve as the bin boundariee in the 
random bin estimate » then each bin contains approximately the s^te 
number of training samples. The random bin density estimate Is 
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f(x) 


0 



k+1 


-5 . ) 

kfl 


0 


X < 

k+1 



k+1 


X < 



k+1 


X > 

k+1 


(IV. 21) 


The following theorem shows that the random bln density estimate 
converges In probability to the true density If k » and k/n ■* 0 
as n For convergeBce of f (x), the bln width must approach zero 

yet contain an Infinite number of training samples « The condition 
k -► 03 lets the bin width tend to zero while k/n “► 0 allows the 
number of samples in each bin to approach infinity » The need for 
k -► ® and k/n -*• 0 as n -► «® can also be seen by inspecting 

A 


(k+iHCj^- g ^ ). 
k+1 k+1 


The conditions k -► and n ® are necessary 


for g 


it! 

k+1 


-t 


k+1 


0. 


Since 


^j+i 

k+1 



k+1 


is multiplied by k+1 and 


k+1 an additional condition of k/n ® is needed in order that 


both (k+1) and (g. . , -g . 

-JL 

k+1 k+1 


converge at rates appropriate for 


(k+1) (g 


J±1 

k+1 


fc+1 


to converge to f(x) 


In the proof of convergence of £(x) to f(x) in the following 


theorem* a lemma is first developed that shows that f(x) follows 
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asymptotically for large n a Gaussian distribution. The lemma shows 


that since g ^ , j«l, 25 ...,k, are asymptotically jointly Gaussian 
k+1 

the asymptotic distribution of l/?(x) =* (k-fl) j ) > which is 

k+1 k+1 


a linear combination of two Gaussian random variables, is Gaussian, 
The asymptotic distribution of f(x) is then proved to be Gaussian, 
The proof of the theorem concludes by showing the convergence of 
f (x) to f (x) . 

Theorem 2: Let be n independent random variables 

identically distributed as a random variable X with an absolutely 

continuous distribution function F<x) and with probability density 

function f(x). Let he the set of n order statistics 

for (X,,X.,...,X ). and let C 4 -If 4„ . j- 1 , 2 , . . . ,k(n) , 

12 n 3 f_J“ — 14.1 

k(n)+l '■k(n>+l-' 

where k(n) Is a sequence of positive Integers such that k(n) “■ 

and k(n)/n -*• 0 as n Define 


n 


0 


1 

k(n)+l 


k(n)+l 




J. 


) 


k(n)+t 


0 


X 


A 



k(n)+l 



k(n)+l 




X < 


A 



k(n)+l 


X 


> c 


k(n) 

k(n)+l 


(IV, 22) 
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Then f (x) is a consistent estimator of f(x) at all x in the neighbor- 
n 

hood of which f(x) and f'(x) are continuous and f(x) ^ 

Before the theorem is proven, the following lemma is developed. 
Lemma 2 i The density estimate defined in Theorem 2 follows 

asymptotically a Gaussian distribution. 

Proof of Lemma 2 i First, l/f^(x) will be shown to be asymptotically 
Gaussian. If a quantiles C » C ^ estimated by 

equation (IV. 19) and f(x) Is differentiable In the neighborhood of 


> then the s quantile estimates ^ ^ follow asymptotically 

Pi Pj ?2 Ps 

an s-variate Gaussian distribution [12] with means 




G"P, 


P. 


(IV. 24) 


variances 


var-(C ) 


Pi(i-Pi) 


Pi n(f(5 )]'' 

^i 


(IV. 25) 


and covariances 


A A 


cov„(C ,€ ) 

G Pj 


PiCl-pJ 


nf(£ )f(C ) 


i < j 


(IV. 26) 


P 




Let X^,X 2 ,..e,X^ be n indep^^ndent random variables identically 
distributed as a random variable X with distribution function 
F<x). 0 (X^,X 2 t ■ . ■ ,X^) is a consistent estimator of 6 if 

6 (X^,X 2 » e • . »X^) converges to 6 as n ->> «>» Convergence in this 
report is shown in probability. 
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A stib script G has been used to Indicate that these are the means 
and variances of the asymptotic Gaussian distribution since the 
means and variances of the asymptotic distribution of a random 
variable are not necessarily equal to the limits of the actual 
means and variances of the variable. Letting s»2, p^ ■ 


and p 


2 ” i c(n)^il ' ’ j ^ ® linear 


k(n)+l k(n)'4*l 

combination of ttfo asymptotically Gaussian random variables and so 


is Itself asymptotically Gaussian with mean 


E^(k(n)+l)(i ^ ) » <k(n)+l)(C ^ ) (IV.27) 

k(n)+l k(n)+l k(n)+l k(n)+l 


and variance 


vargC(k(n)+l)(C 1 

k(n)+l k(n)+l‘ 


n 


lvarg(5j^^ ) -2 coVp( g 

k(n)+l k(n)+l 


A 



k(n)+l 


) 


+ varies ^ )J 

k(n)+l 


(k(n)+l)^ 

n 


(3, 3ii — 

k(n)-*-l ^ k<a)+l^ 


[f(e 


J+1 


)] 


k(n)+l 


_J (x_ JLtL- ) 

, k(a)-H *• k(n)+l ■' 

f( c ^ )f< e ) 

k(n)+l k(n)+l 


+ 


k(n)4-l ^ k(n)+l 
[f(€ j )1^ 

k(n)+l 



(IV, 28) 
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These tvo equations are actually the asymptotic means and variances 
of l/f^(x). Before finding the asymptotic mean and variance of 

/s X 

f (x), it will be shown that var-C^^ ) tends to 0 as n 

n 

This will be shoim by expanding the terms in equation (IV. 28)* 

J ) au(3 


By definition of quantiles. 


k(n)+l 


k(n)+l 


J±I 


k(n)+l ~ ) . For convenience, let • x - C j 


k(n)+i 


k(n)+l 


^ ^ j+1 F(x-'h^), FCjcfhj), f(x-hj^), and f(x+h 2 ) can he expanded to 


k(n)+l 


h; 


F(x-h^) » F(x)-hjf(x) + ’^ f *(0) < a < X , 


(IV. 29) 


F(3P^h2> • '^’2 


X < <(> < C 


J±L 


k(n)+l 

(IV. 30) 


1 . . f'(Y> 

f(X”hj^) £(k) 1 


^ 1 < Y <= X 

k(n)+l 

(IV. 31) 


and 


1- „ -1- h -£’(W) 

f(*Hi 2 ) f(x) 2 


* ^ ^ ^ 1+1 

k(n)+l 

(IV. 32) 


After substituting the above foi^r equations into the expression for 
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1 

var -(7 ) in equation (IV. 28) and performing algebraic manipulations, 

{ + o(h^)+ o(h^)+ ochj^hj) } . 

(IV. 33) 

Kow an expression for (k(n)+i) will be found. Upon subtracting 
equation (IV, 29) from equation (IV* 30), 

F(3ri“h2)-F(x“hj^) * f(x)(h^+h2> + O(b^) + 0(h2) (IV, 34) 

Since F(x+-h 2 )”F(x~hj^) * l/(k(n)+l), It Is found after algebraic 
manipulations that 


the var _(-7 ) becomes 

® f(x) 


var^(:;;^ ) 

£ (x) 


n 


* ■ I ■> ^ w ,■ ■■ I I. n i y i ^l i , ■ 

f(x)+[e(hp+0(h2)l/(hj^+h2^ 


} . 


(IV. 35) 


Substituting this expression into the var^[l/f^(x) ] in equation (IV, 33), 


var fl/f (X)) - MSM{ 
V? n n 


f (x)+t0 (h^)+0 (h2> ] / (hj+h2) 


} 


f (x) 


0(hj)+0(h2>+0(h2h2) 


(IV. 36) 


Equations (XV. 15) and (IV. 16) in the proof of Lemma IV. 1 state that 

^ . “♦‘X and C i . t X as k(n) «», and so h., ^0 and h* -► 0 

j j+i i ^ 

k(n)+l k(n)+l 

as k(n) ®. Since k(n)/n -► 0 as n 
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lim var_[l/f (x)] ■* 0 

V n 


(IV. 37) 


Now that the asymptotic distribution of 1/f (x) has been 

I) 

A 

found and it: has been shown that v£r^[l/f (x)] 0 as n the 

u n 

asymptotic distribution of will be obtained by the following; 

Lemoa (David [12]); Let . . . , be n Independent random variables 

identically distributed as a random variable Xe Then t. (X. • pX ), 

j X z n 

j»l,2,...^m, are m random variables that are functions of (X.. ,X^y . . « ,X ). 

If the random variables t^ (X^^jX^, - - . ,X^) , j«l,2, , • • ^m, have as 3 nnptotically an 

2 

m-variate Gaussian distribuflon with means u , « variances cr. which tend to 0 

j j 

as n and covariances » ®nd if g^(tj) are single-valued 

functions with nonvanishing continuous derivatives SjCt^) in the 
neighborhoods of tj “ ^ j ► then gj(t^) themselves have an m-variatc 
Gaussian distribution with means gj(pj) and covariance (Pj) * 

With m - 1, t - (k(n)+l) (1^^^ - t ^ ), and p - - 


(k(n)+l)(5 . 


k(n)+l k(n)+l 

) and since f(x) 9 ^ 0 , the transformation 


k(n)+l k(n)+l 


g(t) “ “ satisfies the conditions of the leoBoa, Since g*(t) » 

A t 

f (x) is as 3 nnptotically Gaussian with mean 





> 


(IV. 38) 


k(n)+l k(n)+l 


and variance 


" ^k(n)+l 


)+l 


J±1 


- e 


)l^var{l/J (x)] . (IV.39) 
n 


k(n)-^^l k(n)*l'l 
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Theorem 1 states that 




11m 

n->“ k(n)+l / "’j+1 


‘ ^ 1 ) “ f(x) 9 


k(n)+l k(n)+l 


(IVe40) 


and so 


11m E_f (x) ® f (x) 


(IV. 41) 


Further, since var^[l/f(x)] 0 as n-»- »>, 


lim var-[f (x)] ■ 0 
n-«> ® “ 


(IV. 42) 


Lemma 2 has been proven. 


Proof of Theorem 2 : From Lemma 2, ^^(x) follows asymptotically the 

y\ 

Gaussian distribution $ (u) with mean £»(f (st)) and variance 

n bn 

var^tf (x)], 


*n<"> 


(vargff^(x)])^^^ 




-lit 

e dv • 


(IV. 43) 


a 09 


From Lemma 2, E«(f (x)) f(x) and var«[f (x)3 -► 0 as n , so 
bn bn 


u-E^(f (x)) 

lim 

n*^ <var«[f (x))])^'-^ 
b n 


CO 


u < f(x) 
U » f(x) 
U > f(x) 


(IV.44) 


Thus 
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lim <I> (u) ■ 

ti 


n-^ 


Define 


F(u) - 


i O ti < f(x) 

■| u * f(x) 

1 u > f(x) 

0 u < f(x) 

1 u ^ f(x) 


(IV. 45) 


(IV.46) 


The limit of ^ (u) as n *♦- <» equals F(u) at all points for which 
n 

F(u) is continuous. Since is degenerate at u * f(x) in 

the limit, f (x) converges in probability to f(x). 
n 


IV. 2 Restatement of Algorithm for Random Bin 
Density Estimate 

This section presents a concise summary of the procedure for 
finding the random bin density estimate from n training samples. 

1.) Calculations performed with the training samples: 
a) order the n training samples 

y^ < y^ < < y^ (IV.47) 

b> estimate k bin boundaries 


A A 

Ic+r lefl 


■ • • 


A 



k+l 


y 


rJiL-i+l 

^k+1^ 


A 



Icfl 


y kn 


) 


(IV. 48) 
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By storing the k bin boundaries, the entire density estimate is 
stored so that f(x) can be estimated at a later time on line. 


2.) Calculations performed to find f(x) from the bin boundaries in 


equation 

(IV. 48): 




a. ) 

find C j 3nd such that 

k+1 fcfl 



? . < X < 

k*M k+1 



b.) 

then 

f 





0 


X < a 


1 

1 

— / 
k+1 / 

' (C 1 « a) 

a ^ X ^ 

k+1 


f(*) - 

JL / 

k+1 / 

k+1 fcfl 

< X < 

fcfl fcfl 



JL / 

k+i / 

' (b - i ) 

fcfl 

A 

C , < X < b 

k 

k+1 



0 


X > b 


L 


(IV. 49) 


IV-22 


IV»3 Comparison of Random Bln Density Estimate with Other Estimates 

Section III. I stated that density estimates generally are of 
the form either the denominator is specified 

and the numerator estimated or the numerator is specified and the 
denominator estimated. The random bin model is of the latter type 
and so is similar to the NN estimate. Both estimate the Interval 
width from the training samples. The random bin model is similar 
to the fixed bin model in that it is a step- function. In the random 
bin and fixed bin models, the density function is estimated for all 
X at once, and the total estimate is stored. Table XV. 1 lists the 
properties of the random bin model and the three models discussed in 
Chapter III* Table IV. 1 is similar to Table XXX. 1 with the random 
bin estimate added. The remainder of this chapter discusses the 
estimates in more detail. 

IV. 3.1 Storage and Computation Requirements of Density Estimates 

The storage and computation requirements of a density estimate 
can be divided into two parts. One part, to be called on-line, is 
for the storage of the data needed at the time of a classification 
decision and the amount of calculations required to make the decision. 
The other part, called off-line, is for any preprocessing that may be 
necessary before the data is stored for later use in a classification 
algorithm. 

As an example of how off-line and on-line storage and processing 
might be utilized in practice, consider the EEC signals discussed in 
Section 1.2. A possible decision problem is to determine from EE6 




Factor 2 

1 

Total 

Single 

Point ^ 

Point 2 

Estimate 

.4 

Estimate 


Bin Width 
Set by 
Training 
Samples 


Bin 

Width 

Specified 

Properties Influenced 
by Factor 2 


Is bln centered at x? no 

yes 

Storage requirement small 

large 

Computational complexity 
for any x less 

more 

Accuracy iir^ deterministic 
sense less 

more 

Tall region problem yes 

no 



m 

17 , 18 ^ 


Parzen 

f I 16} 18 9 

21}22}23 24 


Properties Influenced by Factor 1 


In f (x)sjp(xeA)/A 

denominator numerator 
specified, specified, 
numerator denominator 

estimated estimated 


Difficulty Convergence 

of bin size conditions as 

choice 4 if training 

more less samples-*" 

specified bin 
^ width“K) at such 

a rate that if 
samples in bin->" 

if samples spec- 
ified in bin-*" 

/ at such a rate ^ 

that bin wldth-K)' 


1. In Total Point Estimate, the density function is 
estimated for all x at once, and the total estimate 
is stored. 

2. In Single Point Estimate, all training samples are 
stored and the density is estimated separately for each x. 

3* These numbers indicate references in the bibliography. 

4. When the bin width is specified, there is a problem of 
how to choose it initially so as to contain a number of 
training samples that would give a reasonable estimate. 

In letting the training samples set the bln width, a reason- 
able estimate is more readily obtained. 

5. The number of samples specified in the bin-*" but a rate 
sufficiently slaver than the total number of training 
samples-*" in order that the bln width that contains the 
specified number of samples-K). 


TABLE IV. i Properties of Fixed Bin, Parzen, NN, and Random Bln Density Estimates 
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measurements the state of consciousness of a patient undergoing 
surgery. The information on the patient's state of consciousness 
would determine the amount of anesthetic to give the patient. 
Calculations on the EEC measurements to determine the density functions 
necessary for such a decision could be performed off-line before the 
surgery x^en large computer facilities would be available. During 
the operation, the testing on the patient's state of consciousness 
could be done on-line with small information storage facilities being 
required. 

When the density estimate is a step-function calculated for all 
X at once as in the random bin and fixed bin models, off-line pro- 
cessing is necessary. But the on-line storage requirement of these 
estimates is small as only the bln boundaries and step-function 
values need be stored, and the on-line calculation of the density 
estimate for any x is simple because only the bin in which x lies 
needs he found. The Parzen and NN models have no off-line processing. 
But the on-line storage requirement is large since all training 
samples are stored, and more on-line calculations are required as the 
bin is centered at x every time an estimation is made. 

As mentioned in the previous paragraph, the random and fixed bin 
estimates require off-line storage. In the fixed bin model, the 
fraction of training samples in each biai is calculated, and each 
training sample may he discarded onee the bin In which it lies has been 
found. In the random bln model, the training samples are ordered 
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and all sapples must be stored during the placement of the bln 
boundaries. Thus the off-line processing requirement of the random 
bin model is larger than that of the fiKed bin model. The fixed 
bin estimate is also easier to update with additional samples. 

IV . 3 . 2 Bin Placement 

In the random bin and NN models, the interval positions are 
determined by the training samples, while In the fixed bln and 
Parzen models, the Interval positions are specified before train- 
ing samples are known. When the Intervals are specified beforehand, 
a bin may contain a very high proportion of the samples; it may be 
necessary to change the Interval and estimate the density again to 
Increase the accuracy. 

The centering of the bln at x in the NN and Parzen estimates 
provides more deterministic accuracy. The random and fixed bln 
models do not center their bins at x, but the decreased deterministic 
accuracy is traded for smaller on-line storage and processing require- 
ments • 


Properties of the random and fixed bin models can be combined 
into one estimate. The bir^ boundaries could be placed by some of th.. 
training samples, then the bins could be taken as specified and the 
fixed bin method applied to the other samples. Such a mixed estimate 


would combine the two modes of density estimation, which are either 


specifying the denominator of 




numerator or specifying the numerator and estimating the denominator. 


The mixed density estimate would operate in each mode one at a time. 
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Sebestyen and Edle [27] have formulated a density estimate 
that is one possible way of combining the two modes of density 
estimation mentioned in the previous paragraph. Sebestyen and Edie 
determine both the number of bins and bin sizes from the training 
samples^ The estimate is a step^f unction. First, an initial set 
of bins is chosen. Then by applying the training samples, some 
bins are enlarged and some reduced, and some new bins are created 
and some old ones combined. The flat parts of the density function 
are approximated by a few, large bins, and the rapidly varying 
parts by more, smaller bins. The motivation of the estimate is 
to minimize the mean square error 

I (f(x)-f(x))^dx (IV.50) 

J „00 

and require little storage. 

Figures III. 4 a,b, and c show an Illustrative comparison of 
the estimates of Sebestyen and Edie, fixed bin, and random bln. 

The Sebestyen and Edie method appears to come the closest to minimizing 
the mean square error. But since the density function estimate is to 
be used to classify observations, it seems more appropriate that the 
estimate should have greater accuracy where observations are more 
likely to occur. In other words, the estimate should be more accurate 
nearer the peaks of the density. Rather than trying to minimize the 
mean square error, a more appropriate criterion is to minimize 
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FIG. EE. 4c 

ContpaiMson of Density Estimates of Sebestyen 
and Edie» Fised Bin, and Random Bin 


w 2 

(f (x)-f (x) ) f (x)dx 


(IV. 51) 



Equation (IV. 51) weighs more heavily the higher values of the 
density function where more observations are likely to occur. 

Since the random bln model places the bins so each bln contains 
ap'i^roxlmately the same number of training samples, more bins are 
concentrated where more samples occur and the model comes closer 
to satisfying equation (IV. 51). It is of course possible to vary 
the random bin model as presented in this thesis and to specify 
different numbers of training samples for different bins. 

IV. 3. 3 Tail Region Problem 

A problem arises with step-function estimates when the random 
variable X is distributed over the interval (-»>«). If f(x) Is 
estimated for x less than the lowest bin boundary or greater than 
the highest bi.'. boundary, f(x) will be zero. For example in the 

^ A 

random bin model in equation (IV, 21), f(x) ■ 0 for k < ^ ^ or 

A 

X > C k * Figure IV. 5 illustrates this occurance. If an estimate 

A 

of f(x) is all that is desired in the tail regions, then f(x) « 0 

A 

is a reasonable estimate. A problem occurs when f(x) becomes part of an 

estimated likelihood ratio i (xlc^)/f (xjc^) as is the case in the 
estimated version of the Wald sequential probability ratio test 
to be presented in the next chapter. When a string of t observations 
has been taken and x^ results in either t(x^lc^) a= 0 or f(x^|c^) « 0, 
the likelihood ratio of the t observations will be zero or Infinity, and 
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FIG. I3Z:.5 

Tail Regions of Random Bln Density Estimate 
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will cause a decision to be made immediately regardless of the previous 
observations. This phenomenon leads to more error decisions in the 
sequential test to be presented than should be allowed by the specific 
error probabilities. The reason is that a df.’Jsion is made on the 
basis of only the one observation. The likelihood ratio Ignores 
previous observations, and the test does not evaluate enough obser- 
vations for the error rates to be small. This tail region problem, as 
it will be called in this thesis, is discussed further in Chapter V 
when the estimated version of the likelihood ratio is presented. 

The Farzen and NN models avoid the tail region problem since 
their density estimates are continuous in x. 

IV. 3, 4 Conclusion to Comparison of Density Estimates 

Table IV. 1 is again recommended for a comparison of the various 
estimates. The next chapter explores the use of the random bin 
estimate in an estimated SPRT. The random bin model is chosen because 
of its small on-line storage and processing requirements and its 
placing of the interval widths by the training samples. 
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Appendix IV. 1 - Discussion of Convergence Proofs of Density Estimates 


This appendix discusses some factors involved in showing con- 
vergence of density estimates. Parzen [16] shows convergence of his 
estimate in the mean square sense. Loftsgaarden and Quesenberry [17] 
show convergence in probability, and this report shows convergence 
of the random bin density estimate in probability. Mean square 
convergence is a stronger form of convergence, and in fact it Implies 
convergence in probability. The reason that convergence of the NN 
and random bin models has been shown in probability appears to be 
that their structure makes convergence harder to prove (it should be 
noted that it has not been shown that they do not converge in the 
mean square sense or with probability one). 

The basic form of a density approximation is 


p (observation £ A) 
A 


(IV. 1.1) 


As mentioned in Section III. 2, a density estimate can either specify 
the denominator and estimate the numerator or specify the numerator and 
estimate the denominator. The Parzen model estimates the numerator, 
and the MM and random bin models estimate the denominator. Because 
of this, it is more difficult to find the means and variance of the 
MN and random bin models* Estimating the denominator of equation 
(IV. 1.1) means estimating the interval width that contains a specified 
fraction of the training samples. Distributions of order statistics 
are involved, and it is difficult to calculate the variance of Interval 
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width estimates from the densities of order statistics. Estimating 
the numerator of equation (IV, 1,1) involves estimating F(x+h) - 
F(x-h), which has a variance that is easier to find. 

To illustrate the factors discussed in the preceding paragraph, 
some examples will be given of the type of calculations involved for 
finding the variances of density estimates. Let the density estimates 
be based on X^,X 2 ,-..,X^ where X^,X 2 ,...,X^ are independent random 
variables identically distributed as the random variable X with 
absolutely continuous distribution function F(x) and with probability 
density function f(x). 

The first density estimate Parzen considers is 


S (x+h)-S (x-h) 
^ /V n n 

^Parzen^^^ * 2h 


(IV, 1.2) 


where S (x) is the fraction of samples less than x. The covariance 


n 


of S (x) and S (x*) is [14] 
n n 


cov(S (x),S(x*)) [F(min(x,x’) )* F(x)F(x*)] . 
n n 


For the general Parzen estimate 




(IV, 1.3) 


Parzen shows that 


lim nh var[f (x)] * f(x) | K (y)dy 
n 

n-x» 


r 


(IV. 1.4) 


It is evident that the limit of the variance of Parzen estimate can 
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be found and mean square convergence can be shown. 
The NN density estimate is 


(IV. 1.5) 


where is the £-th nearest sample to x. The NN estimate involves 
order statistic theory. In making calculations on the NN estimate, 
the type of density function involved is that of the k-th largest 
sample y^^ whose density is 

(k-i)rln-k): 

The random bin estimate is 


^Random<*) 

bin 


k+1 


, ys A 

k+l k+l 


) for c . < X < ? 


k+l 


k+l 


(IV. 1.7) 


where £ = Vr i . <• Is the estimate of the p-th order quantile ^ , 

P -^[npl+1 ^ ^ "’p 

The random bin estimate also Involves order statistic theory, 
and the type of density function used for making calculations on 

A. /S 

the random bin estimate is that of the joint density of ^ and ^ , 

P ^ 

which is 




[X-F(Cq)]""^f(lp)f<eq) 


(IV. 1.8) 


where p < q> i ® [np]+l, and j ** [nq]+l. 
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Since it is difficult to find explicit expressions for the 
variance of random variables with density ftmctions in equations 
(IV. 1.6) and (IV. 1,8) and with F(x) and f (x) unknown, explicit 
expressions for the variance of the NN and random bin estimate 
are even more difficult to find. Thus the convergence of the 
NN and random bln estimates has been shown in probability by 
methods that do not involve finding explicit expressions for the 
variance of the estimates, such as using asymptotic distributions. 
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ESTIMATED SPRT 


Chapter IV developed a density function estimate with the intent 
of utilizing it in a classification procedure. This chapter discusses 
the Wald sequential probability ratio test (SPRT) and then forms an 
estimated SPRT with the random bin density estimate. The SPRT has been 
chosen since the decision problem involving the E£G responses 
discussed in Section 1.2 Is particularly well suited for a sequential test. 
Also a SPRT with density estimates presents some additional interesting 
problems which occur only infrequently in tests that decide on the 
basis of only one observation such as the Bayes decision rule. Some of 
these problems that will be Investigated in this report are estimating 
densities in the tail regions and estimating densities of dependent 
observations. 


V.l Review of SPRT 

A well-known sequential test is the Wald SPRT {4»28,29j. In the 

SPRT, the error probabilities are specified 

A 2 1 

a * p(error of type I) - p(decide C |c true) 

B * p (error of type II) “ p (decide C^|c^ true) (V.l) 

Define the likelihood ratio of t observations 

f (x- fXn , • • • f X j C ) 

L(x,,x_,...,x ) ^ (V.2) 

f(Xj^,X2,*..,x^|c ) 

and two thresholds 


A = 


1-B 


B = 


3 

1-a 


oc 


(^.3) 
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The operation of the SPRT is as follows. 

1. ) Take the first observation If 

L(x^) B decide 

B < < A observe the next observation x^ 

L(x^) ^ A decide 

2. ) If another observation is taken, say the t-th observation x^, 

1 

L(x^,X 2 , . . . ,x^) C B decide C 

B < LCx^jX^, . . . ,x^) < A observe the next observation 

2 

L(xj^,X 2 s • - . sX^) B decide C 

3. ) Repeat step 2 on the next observation until a dedision is made. 
The SPRT takes n - w observations until the information contained in the 


string of observations is sufficient that the probabilities of type I 
and type II errors in making a decision are equal to the specified values 
Qt and 3 respectively. The SPRT has the property that among all tests 
for which a and 3 are specified, the SPRT requires the smallest number 
of observations, on the average, to reach a decision [2,29]. 


When the observations x, are independent, the likelihood ratio can 


be written as 


(V.5) 


f(x |c^)£(x Jc^) 

L(x-,x ,...,x ) -rr — 

f(Xj^|c )f(x2|c )..*f(x^|c ) 

For convenience, in the remainder of the report f(x|C ) is written 
f^(x) and f(x|c^) is written f^Cx). 

The SPRT obtains the information contained in a string of observations 


by evaluating the density functions of each class at the observation 
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values. Knowledge of the density functions of each class are required 
for the SPRT, and so the test is not directly applicable to the case 
where the only prior knowledge is that of training sets, 

Fu [30] has developed a partially distribution-free version of 
the SPRT that uses the training samples of only one class, say C^, If 
the samples from have an arbitrary distribution function F(x), 
then the samples from are assumed to have the Lehman alternative 
distribution, which is F (x) , r>0. After each observation from the 

unknown class is taken, two sets of samples are formed — one from 

1 1 
samples cf C and the other by alternating samples of C with observations 

from the unknown class. The samples of bbth sets are ordered, and the 

density functions of the two ordered sets are found. By assuming the 

12 r 

distributions of C and C are F (x) and F (x) respectively, the ratio 

of the densities of the two orderings is independent of F(x), This new 

ratio of densities is used in the SPRT to determine if the second ordering 

1 12 

contains only samples from C or samples from both C and C . Fu has 

used training samples from only one class and has assumed the distribu- 

2 r 1 

tion of C is F (x) , r > 0, where F(x) is the distribution of C . 

The method presented in this chapter uses training samples from 
both classes and forms an estimated likelihood ratio for use in the SPRT 
from estimates of the density functions of each class. The method is dis- 
tribution-free in that it does not require any knowledge of f^(x) and f^Cx). 

V.2 Random Bin Estimate in SPRT 

V.2.1 Presentation of Random Bin Estimate in SPRT 
Since the density functions f^(x) and f 2 (x) are unknown, they can 
be estimated from training samples of each class, and an estimated 
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likelihood ratio can be formed 

A A 

L(X,,X, X ) - ^ . (V.6) 

f^(Xi)fi<X2)---f^(Xt) 

Let 

be the number of training samples in class 1, 

^2 be the number of training samples in class 2, 
be the number of quantiles for f^(x), 

/V 

the number of quantiles for £ 2 (x), 

^ , j*l,2,...,k^ be the k^ quantiles for and 

ki+1 

T) j , j-1,2, . * • ,k be the k 2 quantiles for ^2^*^ " (V.7) 

k2+l 



(V. 10) 
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If “ k2 ■ k, the computation of L(Xj^,X2> • * • ,x^) is reduced 
since 


fl(Xi> 


^ ^ il ^ 

k^+1 k^+1 k^+l ^ 2 "*"^ 


where 



X. ^ ^ 


j +1 

kj^+1 


and 



k^+i 


« n 


ii+i 

k2+i 


(V.ll) 


Since fj^Cx^), f ^(x^) , . . . ,f ^(x^) are estimated from the same 

A ^ /S 

training samples, f ^(x2) 1 . • . »f j^(x^) are in general dependent, i=*l, 2 . 

So 


E[f^(Xj^)f^(x2> • • 'f^(x^)]^ Ef^(x^)Ef^(x2) • • *Ef^(x^), 

for i- 1,2 (V. 12 ) 

A. 

and L(x^,X 2» - . . s»x^) is a biased estimator of L(x^,x2» • « ■ ,x^) . 

A 

But the next section shows that L(x^,X2> . . . ,x^) converges in 

probability to L(Xj^,X2» • • * »x^) as n^ ^ and n2 ® and is thus 

a consistent estimate (see also conclusion to this chapter) . 

V, 2,2 Convergence of Likelihood Ratio 
11 1 

Theorem 3 ; Let , be a set of independent random 

12 n, 

1 

variables identically distributed as the random variable X with absolutely 

continuous distribution function F^^(x) and with probability density 

2 2 2 

function fj^(x), and let X^,X2,...®X^ be a set of independent random 

2 ^ 

variables distributed as X with F2<x) and ^2^^^ similarly defined. 

^ /N 

Let fj^(x' ' e an estimate and ^2^^^ estimate of f2(x) 
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where the estimates are defined in Theorem 2 of Chapter IV, Define 


L(yj^»y2»* • • 




Then LCy^^.y^t • * • jy^.) converges in probability to 

fi(yi)fi(y2)-“f(yt> 

as n^ ->■ 00 and n 2 for all y^,y 2 > • *y^ the neighborhood of 

which f j^(x) jf J^(x) ,f 2 (x) and f^^x) are continuous and fj^(x) ^ 0 
and f 2 (x) 0. 

y\ 

Proof ; From Theorem 2, converges in probability to 

yv 

as n^ -► 00 and converges in probability to ^2 ”* 

The proof of the corollary follows directly from the theorem from 
Krickeberg [31] that if the sequences of random variables 
converge in probability to Ct D>*--tP then the sequence ‘ ‘ 

converges in probability to g(^,n,,,.,p) if g is a continuous function 
and i® ^ ini t e * 

This section has proposed an estimated SPRT where the likelihood 

ratio is formed from random bin density estimates of each class. The 
estimated likelihood ratio of independent observations: was shown to 
converge in probability to the true likelihood ratio. The remainder 
of this chapter discusses the application of the SPRT to classification 
problems. 



V.3 Tall Region Estimation Problem in the Random Bln SPRT 


One difficulty that occurs with a step-function density 
estimate such as the random bin model is the estimatioM of the tail 
regions of the density function. As an example of this problem, 
consider two overlapping density functions as illustrated in Figure 
V.l with their possible estimates in Figure V.2, Assume that an 
estimated SPRT is being performed and that after t observations no 
decision has been made. Thus 


B < L(x^,X 2» . < . ,x^) < 


Suppose further that the observations to be classified belong to 

/V 

class 1 and that the (t+l)-th observation is greater than ^ 

k+1 

This means that f.(x^,-) “ 0 and so 

1 t+1 

V f, (x ■■ ) - ” < A . 

1 t T 1 


A wrong decision that the observations belong to class 2 is made. 

Since ^ ^ ^ k * ^ decision of class 2 is 

k+1 

made for any x > C , • However, if the actual density functions 

t+1 k 

f (X ) 

2 ^^ t+l'^ 

are known, it is possible that the ratio T'~ 7 Z T ^ ® 

^ l'^ t+1^ 

and that the ratio ^ 2^*t+l^^^l^^t+l^ sufficiently 

k+I 


small 80 



< 
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Thus it is possible that after the (t+l)-th observation where 

A 

x^., > ^ , the estimated SPRT decides class 2 and the true 

t+1 k 

k+1 

SPRT makes no decision. Estimating the tail regions of a density 
function to be zero causes more classification errors than desired, 

A 

Vfhen a decision is based on only the one observation 

the information contained in the previous t observations is 
neglected. The same difficulty is encountered when classifying 

A 

observations from class 2 that are less than n ^ , Experimental 

results appearing later in this chapter verify that the tail region 
problem does result in more classification errors than would be 
expected from the specified error probabilities, A step-function 
estimate does not cause excessive classification errors on obser- 
vations between the tail regions since the likelihood ratio is not 
zero or Infinity. Consequently a decision is not automatically 
made from the information supplied by the one observation. 

The tail region problem occurs mainly when several observations 
are considered at once. If a classification process decides on the 
basis of only one observation, such as the Bayes decision rule, then 
estimating the tail regions to be zero may be acceptable. Since no 
information additional to the one observation is to be taken, no 
information is ignored by the likelihood ratio being zero or infinity. 

Two techniques for handling the tail region problem are discussed 
in the next few sections. The methods either estimate the tail 
regions differently or vary the SPRT. 
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V.3.1 Requiring Several Observations to Fall In the Tail Regions 

One solution to the tail region problem that has worked 

experimentally treats the observations from the tail regions separately 

from the likelihood ratio. The method makes a decision of class 2 

if r observations fall greater than f , refer to Figure V.2, and 

■kfl 

a decision of class 1 if r ob&ervations fall less than n • Only 

k+1 

A A 

observations between X) - and C i, sre included in the likelihhod 

1 K. 

k+1 k+1 

ratio. A decision about a string of observations is made in one of 
two ways, either by the likelihood ratio of observations between 

A A 

n , and C . falling outside the thresholds A and B, or by the 
X Ic 

kfl kfl 

number of observations less than il equaling r or the number of 

k+1 

A 

observations greater than C ^ equaling r. 

The motivation for this solution to the tail region problem 
is that more observations are used in the decision process if r 
observations rather than one are required to fall in each tail region 
before deciding. With an increase in the required number of observations, 
the decision is more likely to be made by the SPRT rather than the tail 
region test, and the combined test is likely to be more accurate. The 
error rate is decreased by increasing r, but the average number of 
observations required for a decision is increased. If r is made very 
large, the observations in the tail regions do not contribute at all 


to the decision process. 
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A disadvantage of the technique presented in this section is 
that the tail region treatment departs from the likelihood ratio 
method of the SPRT. Since observations below n ^ or above ^ ^ 

k+T k+T 

are not included in the SPRT structure, the error probabilities of 
the altered test may differ from tho^ specif ied in a standard 
SPRT. The next section presents a method that estimates the tall 
regions with a different density estimate and preserves the SPRT 
structure for all observations. 

V.3.2 NN Tail Region Estimate 

Another way of handling the tail region problem is to employ 
the nearest neighbor (NN) density estimate of Loftsgaarden and 
Quesenberry explained in Section III. 4.1. The NN estimate is 

f(x) = (V.13) 


where n is the number of training samples and il(n)-th 

nearest training sample to x according to the distance measure 
x-y|. This estimate is continuous in x and tends to zero only as 


X approaches Infinity. Disadvantages of the estimate are that all 
training samples must be stored and the £(n)~th nearest sample to 
X must be found for each x. 

The NN estimate, however, can be used to advantage in the tail 


regions. For any observation x below a certain value, the same training 
sample is always the 5.-th nearest sample to x, and the same is true 
for any x exceeding a certain value. Figure V.3 provides an illustration 



For any x < — - — , 

Y 4 is 4-th nearest 
sample 

X a — I X X- 

y, Y 2 I V3 V4 



. yn-3-^yn 

For any x > — - — ; 

yp _3 is 4-th nearest 
sample 


X X — I — X X- 

yn-3 I 


yi + y4 
__ 


yn-3+yn 

2 


FIGURE 3T. 3 

Nearest Neighbor Density Estimate for Tail Regions 
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with H = 4. Let ^ ^ ordered training samples. 

For any x less than the midpoint between and y^^, y^ is always 
the H-th nearest training sample to x. So the NN density estimate 

yi+yji 

for any observation x < — 5 — is 


f (x) “ / 2jx-yj^| 


(V.14) 


The estimate in equation (V. 14) is greater than zero in the tall 
regions. The values of and the midpoint of y^ and y^ are the 
only information that needs to be stored for later use of the density 
estimates of the tail regions. At the upper tail of the density, 
Frifl-Jt always the Jl-th nearest sample to any x > 

So 

V Jl-1 / I ^ ^ ^iH-1-S.^n 

f(x) - — / 2|x-y^^_^| for x > 5 . 

(V.15) 


The random bln density estimate with the NN tail region estimate 


is illustrated in Figure V,4. t 


n 


1 

k +1 


y^ where I « ^ 


- A 


equation (IV. 20)) is the smallest bin boundary. For any x < a * 

/s ^ 

(yi + ^^)/ 2 , Ji-th nearest training sample to x, 


k +1 

jS 

Similarly ^ 


k +1 


kn. 


^ - ^n+l-fi- A » n« is the largest bin boundary, 


k +1 

and for X > b ^ ^ k * ^n+l-«. nearest 

k+I k +1 


sample to x. The bins have been chosen so that each bin contains 
approximately training samples. Referring to Figure V.4 again, 

the density estimate still has not been determined for the regions 




FIGURE 3C.4 

llAndoin Bln Density Estlaate with NN Tall Region Estimate 
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A«{x|a<x<^^} and B«{x|tj^ < x<b}. The density in 

hT k?I 

regions A and B is estimated for the experimental examples in 
this report by centering a bin from the NN model at the midpoint 
of the regions. (a + ^ ^ )/2 is the midpoint of region A, and 

k+I 

^ + fi)/2 of region B. Thus with each bin containing ^ samples, 

k+T ^ ^ 

(a+ \ ) 

Ja(x) a ^ y for a < x < ^ ^ (V.16) 

k+T 

where ^y is the J6-th nearest training sample to (a+ )/2, and 

k+1 

^ k 

k+1 

B ~ 

where ^y is the 5,-th nearest training sample to + b)/2. 

k+1 

The density has a constant value throughout each interval A and B. 

The tail regions were estimated by the NN model in the manner explained 
in order to assure that the bins in the tail regions contain approximately 
the same number of samples as the bins in the center region which had 
been estimated by the random bin model. 

V.4 Experimental Results of the Estimated SPRT Tested on Gaussian Data 
This section shows the results of the SPRT with the random bin 
density estimate tested on independent, scalar Gaussian samples. The 
mean of the distribution from class 1 is -0.8, and the mean of class 2 
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is +0.8. The variance of both classes is one. Experimentally, a 

good relationship between the number of training samples n and the 

1/2 

number of quantiles k seems to be k - n • Loftsgaarden and 

Quesenberry [ 17] also state that on the basis of some eraperical 

1/2 

work using their estimate a value of ^ near n appears to give good 
results. For the following examples, n * 999 training samples and k =* 29 
quantiles (giving k+1 ■ 30 bins) were used for each density estimate. 
After the density functions of both classes are estimated, the 
estimates were tested in the SPRT with one thousand test observations 
from each class. The test was conducted with several values of the 
error probabilities, a« p (decide class 2 | class 1 urue) and 8 " p(declde 
class 1 jclass 2 true). The next two sections present the experimental 
results for the two tail region treatments discussed in Sections V.3.1 
and V, 3.2. 

V.4.1 Experimental Results of the Estimated SPRT With r 
Observations Falling in the Tail Regions 

Section V.3.1 discusses the treatment of the tail region where a 
decision is made either by the SPRT applied to observations between 
the tail regions or after r observations fall in one of the tail regions. 
Table V.l shows the experimental results. Values of r from one to 
five were considered. The error rates in Table V.l for r * 1 represent 
neglecting the tail region problem and allowing f^(x) and ^2^*^ 
zero for observations in the tall regions. It is observed that the 
experimental error rates for r » 1 are Indeed higher than the specified 
a and The error rates are decreased by increasing r. More obser- 
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a = 3 

Number 
observations 
in tail 
regions for 
decision 
r 

Experimental Results 


Experimental 
error rate 

Experimental average 
number observations 
for decision 

Class 1 

Class 2 

Class 1 

Class 2 

.1 

1 

.084 

,059 

2.15 

1.92 


2 

.044 

.026 

4.04 

3.71 


3 

.064 

,035 

5.35 

4.95 


4 


.055 

6.37 

6.14 


5 

.058 

.043 

7.25 

7.20 

.01 

1 

.080 

.061 

2.49 

2.12 


2 

.015 

.088 

5.13 

4.39 


3 

.0075 

0.0 

7.A7 

6,42 


4 

.019 

0.0 

4.45 

8.20 


5 

0.0 

0.0 

11.11 

10.0 

.001 

1 

.081 

.062 

2.53 

2.15 


2 

.016 

.013 

5.38 

4.46 


3 

0.0 

.0067 

8,06 

6.76 


4 

0.0 

0.0 

10,31 

9.01 


5 

0.0 

0.0 

12,82 



10,88 


n = 999 training samples in each class k+1 = 30 bins 

1000 test observations from each class 


Gaussian - 

Estimated SPRT with r Observations Falling in Tail Regions 


TABLE V,1 
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vations on the average are taken before a decision for the increased 
r. From Table V.l, a value of r « 3 appears to be adequate to bring 
the experimental error rates down to the specified a and 3* and 
r “ 4 certainly appears sufficient, 

V.4,2 Earner imental Results of the Estimated SPRT with NN 
Tail Region Estimate 

The random bin density estimate combined with an NN density 
estimate in the tail regions is discussed in Section V.3,2. The 
experimental results of the SPRT formed with this estimate are 
shown in Table V,2. The parameter i in the NN estimates (see 
equations (V.14), (V.15), (V.16), and (V.17)) is set equal to 33, 
which is approximately the number of samples in each interval of 
the random bin model. The experimental error rates in Table V,2 
are observed to be below the specified oi and 3« 

V,5 Conclusion to Chapter V 

In comparing Sections V.3.1 and V,4,l with Sections V.3.2 
and V,4.2, the NN density estimate appears to be a more satisfactory 
solution to the ti.il region problem. With the NN method, the strudture 
of the SPRT is preserved and the specified error probabilities ^ and 
retain C:heir meaning. 

Section V.2,1 mentioned that the marginal density estimates 
f(x^), f (X 2 ),..- (x^) that multiply together to form the joint density, 

f(x^,X2,c ,.,x^) « f (Xj^)f(x2)**‘f(x^) , 
are dependent since they are estimated from the same training samples. 



Experimental Results 


ct => 3 


.001 


Experimental 
error rate 


Average number 
of observations 
required for 
decision 


Class 1 

Class 2 

Class 1 

Class 2 

.033 

.046 

2.75 

3.29 

.005 

.0062 

5.0 

6.21 

0.0 

0.0 

7.14 

9.09 


999 training samples in each class k+l=*30 bins 

1000 test observations from each class 


Gaussian - 

Estimated SPRT with NN Tail Region Estimate 


TABLE V.2 
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A 

Thus, L(x^,X 2 > . - . ,x^) is a biased estiinator of LCx^jX^, . . . ,x^) , 
although the bias tends to zero as n On inspecting 

Table V.2, this dependence appears to have not adversely affected 
the experimental error rates. The dependence is discussed further 
in the next chapter. So far only scalar samples have been con- 
sidered, and the next chapter also discusses multidimensional 
samples. 



CHAPTER VI 


MULTIDIMENSIONAL SAMPLES AND DEPENDENT OBSERVATIONS 


This chapter discusses some techniques for handling multi- 
dimensional samples and dependent observations in the estimated 
SPRT. In considering multidimensional samples, the symbol s 
denotes the total number of dimensions or features of a vector 
sample, and the number of a particular feature is indicated by 
a superscript, for example is the i-th feature of the sample 


, 1 2 
X ® (x ,x 



(VI. 1) 


VI. 1 Multidimensional SPRT 

One method of classifying independent multidimensional observations 
with the SPRT is simply to form the estimated likelihood ratio with 
multivariate density estimates 


^ / 1 2 /12 Sv ^/12 s. 

^ , . ,X2) • * *f 

L(Xi»X 2 ,...,Xt) = ^ y 1 ~ 2 s ^ “ 1 2 sT - . 1 2 ^ 

f^(x^,x^, . , . ,x^)f^(x2»X2 X 2 ) * * -f^(x^,x^,. . . ,x^) 


(VI. 2) 


But estimating the density of an s-dimensional random variable requires 
a large number of training samples. As the dimension increases, more 
bins are needed to maintain deterministic accuracy, and then more 
training samples are needed to assure random accuracy by each bin 
containing an adequate number of samples. 

The approach used in this report for treating multidimensional 
samples is to transform the vector samples into scalars such that the 
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new scalars are random variables whose univariate density functions 
can be estimated. The estimation of the univariate densities of the 
scalar transformed samples requires fewer training samples than 
the estimation of the multivariate densities of the original vector 
samples. While multivariate density estimates are not considered 
in this thesis, Appendix VI. 1 briefly discusses how the density 
estimates mentioned in Chapters III and IV can be extended to the 
multidimensional case, 

VI. 1,1 Linear Combination of Features 

As mentioned in the previous section, if the multidimensional 
samples of each class are transformed into scalars, the simpler 
univariate density functions can be estimated with fewer training 
samples. The estimated SPRT can be formed with the ratio of the 
univariate density estimates of the transformed samples of each class. In 
essence, a new classification problem has been formulated involving only 
scalar samples where the two classes of scalar samples are the transformed 
original multidimensional samples of the two classes. 

Among the infinite variety of transformations that can be chosen, 
a transformation should be selected such that 

i) the transformed scalar possesses the various properties 
required for the estimation of its density function and SPRT as dis- 
cussed in Chapter IV, and 

ii) the transformed scalar samples of the two classes should be 
separated as much as possible in some sense. 

This section explores the use of a linear transformation 
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z = + Y (VI. 3) 

JL i S 

where Y^i i=l,2,...,s are weighting factors. A linear transformation 

has been chosen because of the ease of finding such a transformation. 
12 s 

If (X ,X ,...,X ) is an s-dimensional random variable of the continuous 

12 s 

type, then Z = Yi^ ■*" Y 9 X + • * • -f Y ^ is a random variable of the 
continuous type and Z satisfies all the required properties presented 
in Chapter IV for the estimation of its density. The choice of 
linear transformations to separate classes of training samples was 
discussed in Section 11,4.1. Section 11,4.1 mentioned that many 
algorithms have been developed for placing a separating hyperplane 
between two classes of samples [ 1 ], and that the equation of such 
a separating hyperplane can be used as a linear transformation to 
reduce the multidimensional samples to scalars. The specified error 
probabilities ot and 3 of an SPRT can still be met if densities of 
scalar transformed samples are used instead of the original multi- 
dimensional samples. The knowledge of the multidimensional density 
estimates, however, would be expected to provide more decision making 
information than knowledge of the density estimates of the transformed 
samples. The information loss of transformed density estimates occurs 
in an increase in the average number of observations required for a 
decision. Nevertheless, the advantages of scalar transformed samples 
are fewer training samples needed to estimate the density and the 
simpler calculations for a univariate density estimate. 
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VI, 1.2 Discussion of EEG Data 

Experimental testing of the SPRT with a linear combination of 
features was performed on the same EEG data that was used for exper- 
imental testing in Chapter II. The classification problem with EEG 
is outlined in Section 1.2, and Appendix II. 2 analyzes the EEG 
data in detail. The classification problem is to decide if an 
arbitrary string of EEG responses are stimulated by a subject where 

class 1 : no light is flashing (normal response) 
or 

class 2 ; a light is periodically flashing into the subject's 
eyes (evoked response) . 

As mentioned in Chapter II, the length of responses between the 
flashtiS is one hundred milliseconds, and each response is considered 
to be an observation or sample. The waveforms measured from the 
patient are continuous and were converted to vector samples by sampling 
the amplitude every millisecond. The sampling resulted in a one 
hundred dimensional vector. Since a dimension of one hundred was 
quite large, five features out of the hundred were selected for the 
classification process. The feature reduction scheme of Prabhu [^ ] 

(the feature reduction scheme is explained in Appendix II- 1) was used 
to select the five features which have the most classification infor- 
mation according to a criterion that separates the sample means of the 
two classes and minimizes the sample variance about the means. A linear 
transformation was applied to the samples with the coefficients of a 
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separating hyperplane determined by the scheme of Prahhu. 

The random bin density model was estimated for each class from 
999 transformed training samples. The number of quantiles was k = 29, 

An SPRT formed from the density estimates was tested on one thousand 
transformed observations from each class. The next two sections 
show the test results for the random bin SPRT with the two tail region 
treatments discussed in Sections V.4.1 and V.4.2, 

VI, 1.3 Experimental Results of the Estimated SPRT with r 
Observations Falling in the Tail Regions - EEG 
Table VI. 1 shows the EEG experimental results where a decision 
is made either by r observations falling in a tail region or by the 
SPRT applied to observations occuring between the tall regions. 

Values of r from one to five are treated and three different specified 
error probabilities a and 3 considered. On inspecting Table VI. 1, 
it is seen that the experimental error rates are on the order of the 
specified probabilities of error if r equals four or five. Comparing 
Table VI. 1 and V.l, the error rates for the EEG samples are higher for 
the same values of r than for the Gaussian samples. The EEG responses 
as they occur serially in time are dependent, and so the Independence 
assumption is not met. Independence was assumed both for saying that 
the joint density of several observations is equal to the product of 
marginal densities and for estimating the marginal densities from 
training samples. The dependence accounts for the higher error rates in 
Table VI. 1. Also the EEG signals are slightly nonstationary. 
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\ 

a - 3 

Number 
observations 
in tall 
regions for 
decision 
r 

Experimental Results 

Experimental 
error rate 

Experimental average 
number observations 
for decision 

Class 1 

Class 2 

Class 1 

Class 2 

.1 

1 

.105 

.045 

2.09 

2.08 


2 

.074 

.047 

4.1 

3.91 


3 

.067 

.053 

5.62 

5.32 


4 

.067 

.062 

6.75 

6.25 


5 

,061 

.068 

7.58 

6.85 

.01 

1 

.104 

.043 

2.36 

2.39 


2 

.049 

.029 

4.95 

4.81 


3 

,029 

.028 

7.46 

7.05 


4 

.019 

.018 

9.61 

9.26 


5 

0.0 

CM 

O 

• 

11.9 

11.1 

.001 

1 

.10 

.034 

2.41 

2.44 


2 

.051 

.020 

5.01 

4.95 


3 

.031 

0.0 

8.06 

7.58 


4 

0.0 

0.0 

10.65 

10.0 


5 

0.0 

0.0 

12.8 

12.5 


n - 999 training samples in each class Itfl = 30 bins 

1000 test observations from each class 


EEC - 

Estimated SPRT with r Observations Falling in Tail Regions 


TABLE VI. 1 
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VI. 1.4 Experimental Results of the Estimated SPRT with 
NN Tail Region Estimate - EEG 

Table VI. 2 shows experimental results for the SPRT with the 
tail regions of the densities estimated with the NN model. The para- 
meter i for the NN estimate (see equations (V.14), (V. 15) ? (V.16), 
and (V.17)) was set equal to 33 so each bin whether from the random 
bin or NN models contained approximately the same number of training 
samples. The experimental error rates in Table VI. 2 are observed 
to be higher than the specified a and B. As mentioned in the previous 
section, the observations are dependent, and the independence assumption 
is violated. The next section discusses a method of overcoming the 
problem of dependence of observations. 

VI. 2 Dependent Observations 

So far in this thesis the observations have been assumed to be 
independent so that the joint density of t observations f . . . ,x^) 
can be expressed by f (x^)f (x^) • ‘ *f (x^) . The method presented in this 
section treats dependent observations by using the density of the sum 
of t observations rather than the joint density of t observations. 

VI, 2.1 Using the Sum of Observations in the SPRT 

The method to be presented for testing correlated features is a 
variation of the approach of taking a linear combination of the features 
of multidimensional samples . In the usual SPRT, the likelihood ratio 


of t observations is 
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a = e 

Experimental Results 

Exp er imen t al 
error rate 

Average number 
of observations 
required for 
decision 

Class 1 

Class 2 

Class 1 

Class 2 

.1 

.136 

.0345 

2.83 

2.47 

.01 

.0698 

.0092 

5.81 

4,63 

.001 

.0517 

0.0 

8.62 

6.67 


n = 999 training samples in each class k+l«30 bins 

1000 test observations from each class 


EEC - 

Estimated SPRT with NN Tail Region Estimate 


TABLE VI, 2 
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1 
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* * 3 

••.Xj.) 


(Vl.-t) 


and if the observations are independent, the ratio can be writter as 


f2(Xi)f 2(x2> * ■ 
f3^(Xi)fi(x2)‘ * ’ffCx^) 


(VI. 5) 


If the observations are dependent, the two likelihood ratios are 
not equal, and the error rates of the dependent EEG samples in 
Table VI. 2 where the likelihood ratio in equation (VI. 5) is used 
are indeed higher than the specified error probabilities. Instead 
of the likelihood ratio of the joint densities of t observations, a 
possible likelihood ratio is that of the densities of the sum of t 
observations 


f 2 (xi+X 2+. . .+x^) 

f l(Xi+X2+. . 


(VI. 6) 


t 

The sum of t observations J x. is a scalar, and thus the estimate 

i-1 

of this likelihood ratio involves estimating only univariate density 

functions. The likelihood ratio in equation (VI. 6) is exact even 

if the observations are dependent! In essence, a new random variable 
t 

X, has been defined. If the X , i=l,2,...,t, are random 'Variables 
i-1 ^ t 

of the continuous type, then J X. is a random variable of the 

i-1 

continuous type and satisfies the requirements presented in Chapter 
IV for its density function to b« estimated, A string of observations 
can be classified by the SPRT formed with the likelihood ratio of 
equation (VI, 6). While the SPRT formed with the new likelihood 
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ratio can meet the specified error probabilities, the sum of t observations 
contains less decision making information than the values of the 
separate t observations. The loss of information results in a greater 
average number of observations being required for the test to make a 
decision. Thus the new test no longer has the property of the regular 
SPRT that among all tests for which a and 3 are specified, it requires 
the smallest number of observations to reach a decision on the average. 

But using the likelihood ratio of the sums of observations provides 
a test that is exact for dependent observations and that involves only 
the densities of scalar samples. 

In discussing the likelihood ratio in Section V.2.1, the product 
of estimated marginal densities was substituted for the estimated 
joint densities since the observations are independent. But because 
the marginal densities are estimated from the same training samples, 
they are dependent and 

E[f (x^)f (X 2 ) * • (x^)] 4 ^ Ef (x^)Ef (X 2 > " • -Ef (x^) 

(although equality does hold as the number of training samples 
apnroaches infinity) . The product of marginal density estimates was 
used, however, since the estimation of the t-variate density f (x^,X 2 , • . ,x^) 
for large t r*,iquires a large number of training samples. The estimated 
likelihood ratio of the sums of observations avoids any problems associated 
with the dependence of marginal density estimates. 
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VI. 2.2 Practical Consi-^ - rations In Usin^ the Sum of Observations 
in the Estimated SPRT 


If the estimated SPRT is performed with the likelihood ratio 

of the sum of observations, the density functions of the random 
t 

variables x, need to be estimated, 
i=l 


f (x^) ,f (x^+X 2 ) , . . . ,f ( I x^),... 


The random variables are scalars so the density estimation is 

straight forward. But in an SPRT, the number of observations t 

may become large, and the number of training samples needed to 
t 

estimate f( J x ) increases as t increases. To obtain m different 
i=l ^ 

t t 

samples of ^ x. for the estimation of f( x ), mt samples of x 
i=l ^ i=l ^ 

are required. For a finite number of training samples, it is 

t 

possible to accurately estimate f( J x. ) for only smaller values of t. 

i=l 

In the experimental results of the next section, the maximum number 
of observations summed together is six so that an adequate number 
of summed samples would be obtained from which to estimate the densities. 
In a string of observations larger than six, the product of several 
densities of sums is taken. For t observations, the ratio would be 


6 12 t 

^2^ t ^ I X ) 

i-1 1=7 l=[t/6]6+l 


12 


f ( I X )f^( I X )•••! ( y 
1=1*^ 1=7 ^ ^ l=[t/ 


6] 6+1 


X.) 


(VI. 7) 
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This ratio is of course equal to equation (VI, 6) only if 


6 12 t 

y X , y X y x, are independent. However, 

i=l ^ i=7 i i={t/6]6+l ^ 

equation (VI. 7) provides better results than equation (IV. 6) 

because for t observations equation (IV. 7) assumes the independence 

of [t/6]4*l random variables and equation (VI, 6) that of t variables, 

6 

Also if x^jX^ 

12 

and y Xj is less than that between two consecutive x. **• When 

. i i 

1=7 


,x 


12 


are dependent, the dependence between ^ x. 


i=l 


u is the maximum number of observations in any sum, the general 
expression for the likelihood ratio is 

u 2u t 



x^)f 2( 


I X )-**f 

i=u+l ^ 

2u 


2 ^ ^1^ 
i='rt/u]^~*~i 


(VI. 8) 



X )f-( I X )-**f-( 

i=u+l i- 


I 

[t/u]U+l 


VI, 2,3 Experimental Results of Using the Sum of Observations - 
EEG 

Table VI, 3 shows the experimental results of the estimated 
SPRT formed with the ratio of estimated densities of sums of obser- 
vations, The EEG data discussed in Section VI. 1.2 was used. The 
maximum number of observations summed together is six, which means 
that the densities of the sums of one, two,..., and six observations 
need be estimated, 

- - A f 

f (x^),f (x^+X2> ,f ( 2 

i=l 

The total number of training samples used was 1476, and so the densities 
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a = 3 

Experimental Results 

Experimental 
error rate 

Average number 
of observations 
required for 
decision 

Class 1 

Class 2 

Class 1 

Class 2 

.1 

.0618 

.0278 

5.67 

5.55 

.01 

o 

o 

0.0 

16.4 

13.9 

.001 

- 

0.0 

0.0 

25.6 

20.8 


1476 training samples, k+1 = 15 bins 

246 sums of 1,2,..., 6 samples 
in each class 

1000 test observations for each class 


EEC - 

Estimated SPRT Using Sums of Ol^servations in 
Random Bln Density Model with NN Tail Region Estimates 


TABLE VT.3 













were estimated from 2A6 groups of six training samples (1476 was 
the largest nxjimber of training samples available for experimentation 
that was divisable by 6.) The densities were estimated by the random 
bin model with fifteen bins combined with the NN model in the tail 
regions. 

The experimental error rates in Table VI. 3 meet the specified 
error probabilities. The error rates in Table VI. 3 are lower than 
those in Table VI, 2, which shows the results of the product of 
marginal density estimates, but Table VI. 3 requires more observations 
on the average for a decision. Increased accuracy has been gained 
by using the sum of observations. 

VI. 3 Conclusion to Chapter VI 

This chapter has discussed some ways of handling multidimensional 
and dependent samples. For multidimensional samples, the samples are 
reduced to scalars by a linear transformation; for correlated samples, 
the likelihood ratio of the sums of observations is taken. The objective 
of these procedures is to allow univariate densities to be estimated 
rather than joint densities. Increased accuracy in the error rates 
has been achieved, but the average number of observations necessary for 


a decision has increased. 
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Appendix VI. 1 - Multivariate Extensions of Penalty Estimates 

Considered in Chapter III and Chapter IV 

The presentation of multivariate density function models in 
this appendix is brief and is intended only to indicate ways the 
models are generalized to multidimensional samples. The dis- 
cussion is not detailed, and convergence conditions are not shown. 

The approach in generalizing the marginal density estimates 
to multidimensional samples is to extend the interval A in equation 
(III. 2), which is repeated here 

Um A) . = f(K) . 

n'+oo 

to a multidimensional voliime element. 


Multidimensional Fixed Bin Estimate 

The extension of the fixed bin model (see Section 111,3.1) 
to the multidimensional case is straightforward. Instead of 
specifying bins in one dimension, bins are constructed in s dimensions. 
The multidimensional equivalent of the fixed bin model is 


1 2 Sk 

f (x ,x , . . . ,x ) 


number of samples in bin i 
total number of samples 


volume of 
s-diraensional 
bin i 


(VI. 1.1) 
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Multidimenalonal Parzen Estimate 

The Parzen estimate (see Section III. 3. 2) can be generalized to 
the multidimensional case by replacing the one dimensional interval by 
a multidimensional volume element. To obtain the density est'‘mate, the 
fraction of training samples in an s-dimensional bin centered at x is 
divided by the volume of the bin. 


f( 


1 

X 


2 


,x 



number of samples in bin centered at x 
total number of samples 


/ volume 
^ of bin 

(VI. 1.2) 


The general Parzen estimate in equation (III. 5) is extended by using 
kernels of s variables * 


Multidimensional NN Estimate 

Loftsgaarden and Quesenberry [17] give the multidimensional 
generalization of their estimate. Centered at x is an s-dimensional 
hypersphere whose radius is the distance from x to the £(n)-th 
nearest sample measured by some metric estimate in 

equation (III. 9) extends to 


^,12 s. 

f (x ,x , . . . ,x ) 


jl(n)-l / volume of hypersphere 
n / radius d(x,Xn,_^) 


’’)l(n) 


£(n)-l ^ 

n / 


sr(f) 


(VI. 1.3) 


where x 


il(n) 


Is the J.(n)-th nearest training sample to x. 
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Multidimensional Random Bin Estimate 

In extending the random bin estimate to the multidimensional 
case, the objective is to cover the s-dimensional sample space with 
s-dimensional bins while letting the boundaries of the bins be 
determined by the training samples. The multidimensional estimate 
is presented by considering a two dimensional example. Figures VI. 1 
and VI. 2 can be consulted to provide visual illustrations* As shown 
in the figures, the multidimensional estimate partitions the sample 
space into volume elements where each element contains the same 
percentage of training samples. 

First, the sample space is partitioned into strips parallel to 
2 

the X -axis in such a way that each strip contains an equal fraction 
of the training samples. See Figure VI, 1. The n two dimensional 
samples. 



»xp,(x 


1 

V 



. , (x^,x ) 
n n 


(Via. 4) 


are ordered according to the values of the first features, 

(x]“ ,x? ),(x| ,xj (x^ ,x^ ) (VI. 1,5) 

il 1^ X2 I 2 

where 

1 ^ i ^ 

12 1 

Such an ordering uses an ordering function g^(x ,x ) =* x , Let 

1 

the integer be the number of lines drawn to partition the x -axis. 
Then of the first features in equation (VI. 1.5) are selected and 
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Figure VI. 1 First Step in Bin Placement for Multivariate 

Random Bln Density Estimate 



k^+1 k|+l k|+l 


Figure VI , 2 


Bln Placement for Multivariate Random Bin 
Densltv Estimate 
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labeled according to 


g = X. for j=l,2,...,k . (VI. 1.7) 

y\ 

So a set of first features is chosen, 1 2 kj^ ) • 

Lines are drawn parallel to the x -axis through the k^ samples whose 
first features have the values specified in equation (VI. 1.7). The 
strips between the lines each contain approsimately the same number 
of training samples. 

Each strip is now partitioned separately into k^+l parts by 

1 

drawing lines within each strip parallel to the x -axis as shown in 
Figure VI. 2. Each segment is to contain approximately the same 
number of training samples. The partitioning procedure of each 
strip is shown by considering one strip, say the 0-th strip. Let 
n^ be the number of samples in the p-th strip. The fact that the 
p~th strip is being considered is indicated by placing a superscript p 
on the pairs of parentheses enclosing the samples in the p-th strip. 


(x^,x^)^, (x^^x^)^,. . . , (x^ ,x^ . 

P P 


(VI. 1.8) 


The samples in the p-th strip are ordered according to the values 
of the second features 




X? )^, . . . , (xf 


n 


P 



P 


(VI. 1.9) 



VI-20 


where 



< X 


n 


) 


12 2 

The ordering function that has been used for this is ~ ^ • 

Select of the second features from the set in equation (VI. 1.9) 
and relabel them according to 


k2+l 


2t ^ If 2f...fh.A • 


£n 


So a set of k 2 second features 


(VI. 1.10) 



k2+l k2+l 



k2+l 


has been chosen from the samples in the p-th strip. Lines parallel 
1 

to the X -axis are urawn through the k 2 ‘Samples in the p-th strip 
whose second features have the values given in equation (VI, 1,10). 

The lines extend only between the boundaries of the p-th strip as is 
shown in Figure V.2. 

The other strips are also partitioned by the method explained in 

the previous paragraph. The two dimensional sample space is now partitioned 

into (k^*fl) (k 2 +l) parts as in Figure VI. 2. The density estimate for 

1 2 

any observation x = (x ,x ) is 


(k^+1) (k2+l) 




)(T| 


kf+1 


k^+i 


m. 

k^+i 


-n' 


kz+l 


*^12 
f(x ,x ) = 


) (VI, 1.11) 
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where ^ 


X ^ C 


k^+1 


p-f 1 


and ^ < x^ < 


k2+l 


k^+l 


£ , and ri^. are defined by equations (VI. 1.7) and (VI. 1.10). 

«3 - ,_ ,, , 3_ 


k^+i 


k^+i 


The density estimate in equation (VI, 1.11) has involved a 

partitioning of the sample space with ordering functions. Ordering 

12 1 12 2 

functions other than g^^(x ,x ) = x and g^(x ,x ) = x could be 
chosen. The estimate can be extended to more than two dimensions 
by repeating the procedure of partitioning the sample space for 
the additional dimensions. 

The approach to the multivariate random bin density estimate 
explained in this appendix has a possible drawback. In the presentation 
of the bivariate extimate, bin boundaries are first placed parallel 
to one axis, and then each of these intervals is subdivided. This 
method does not treat the samples symmetrically. Long, thin bins 
may result where wider, shorter bins would be more desirable. By 
using several different ordering functions during the partitioning, 
it may be possible to modify the method to overcome this difficulty. 
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CONCLUSION 


VII^l ConcluciinR Remarks 

Two sequential, distribution-free pattern classification 
procedures have been presented. Estimates of the probabilities of 
misclassif ication have been given, and experimental results of 
testing on Gaussian and EEC patterns agree with the estimated error 
rates. An estimate of a probability density function has also been 
proposed. 

In the method based on order statistics, a set of thresholds 
is determined from the training samples, and each observation in the 
sequential test Is compared to a different pair of thresholds defend- 
ing on the particular iteration. In the method based on the SPRT, 
the likeliheod ratio is estimated from the training samples. The 
estimated likelihood ratio is then updated to include each new 
observation and is compared to the same pair of thresholds through- 
out the test. 

The information carried from one iteration to the next in the 
sequential test based on order statistics is that the previous obser- 
vations fell in the intervals between their respective thresholds 
at each Iteration, In the ertimated version of the SPRT, the two 
density functions are estimated at the values of the observations, 
and so more precise Information about the location of the observations 
is carried from one iteration to the next. The estimated SPRT uses 
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local information of the training samples near each observation while 
the order statistics method considers ^ all training samples at once 
to determine the thresholds. 

When the number of training samples is limited, a smaller error 
rate is experimentally easier to obtain with the estimated version 
of the SPRT. As mentioned in the previous paragraph, the estimated 
SPRT uses more precise information on the location of the observations. 
The method based on order statistics determines the thresholds 
directly from the training samples. If the specified probability of 
misclasslf ication at each iteration is small, the intervals outside 
the thresholds will contain fewer training samples, and consequently 
the accuracy of the estimated probability of a future observation 
falling in these intervals is less. The specified error probabil- 
ities may also be so small that the number of training samples that 
are calculated to be contained outside the thresholds is less than 
one. In the estimated SPRT, density functions are estimated from 
training samples; the number of samples in each interval of the step- 
function density estimate is a parameter of the density estimate 
and is independent ot the desired error rate. Each bin of the density 
estimate can be required to contain several training samples, and 
thereby the accuracy of the density estimate can be controlled. 

Thus when the number of training samples is limited, the estimated 
SPRT performs better at samller error rates. 

The estimated SPRT has fewer prior assumptions about the 
pattern classes. Chapter II mentioned that in order to use the 
order statistics method the pattern classes should have one region 
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of overlap such that when multidimensional samples are transformed 
to scAInrs the new scalar samples of one class lie largely 
below those of the other class. The order statistics method with 
a linear transformation cannot solve decision problems where the 
samples of one class are surrounded by those of the other class. 

The estimated SPRT, which estimates density functions, does not have 
this restriction. But the order statistics procedure is simple to 
Implement and is well suited to the case where the two classes 
can be separated to a degree by a linear transformation. 

The number of training samples would be expected to influence 
how small an error rate can be obtained and the accuracy of the 
predicted error rates. Arbitrarily small error rates would not 
be expected to be obtainable from a limited number of training 
samples due to inaccuracies in the estimation procedures. The 
experimental error rates presented in this thesis do agree with the 
predicted error probabilities. In fact for the estimated version of 
the SPRT, error rates as small as .1 percent were obtained with 
1000 training samples from each class. 

VII. 2 Suggestions for Future Wor k 

1) The approach taken in this report for treating multidimen- 
sional samples was to reduce them to scalars by a linear transformation. 
Linear transformations that separate the two pattern classes were 
selected. A possible area for future work is to Investigate the use 
of nonlinear transformations. Improved separation of the two pattern 
classes might be obtained with nonlinear transformations, and the 
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average number of observations taken for a decision would be expected 
to decrease. Also, different transformations might be used in different 
regions of the sample space, 

ii) More efficient use of the observations taken In the sequential 
test based on order statistics may be possible by comparing all the 
observations taken up to each iteration with the latest pair of thresholds 
instead of only comparing the most recent observation. The calculations 
for the thresholds should be modified to take into account that all 
previous observations are being compared to the thresholds at each 
iteration since the estimated probabilities of taking the next 
observation are now different. By comparing all observations, the 
sequential test would be expected to make a decision after taking 

fewer observations. 

iii) Some improvement in the random bin density estimate 
might be possible by developing an interpolation technique to smooth 
the estimate so that it is continuous rather than a step-function. 

Also, it may be possible to generate a continuous estimate of the 
distribution function by an interpolation procedure and use it in the 
a^equentlal test based on order statistics. With a continuous distri- 
bution function estimate, the thresholds could be placed more pre- 
cisely for the desired error rates rather than setting thresholds only 
equal to the values of training samples, 

iv) The density estimate proposed in this report is a step-function. 
This means that the distribution function is approximated in each interval 
by a linear curve. An improved density estimate might be obtained by 
fitting a nonlinear curve in each interval. There is a set of m 
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sample values {x^}, . . . ,in, in each interval and a set of estimated 

distribution function values for these samples (F(x^)}, i=l, 2, . . . ,m, 

A non-linear curve could be fitted to these points, and the density 
function would of course be the derivative of the curve. It should 

/*v 

be kept in mind, however, that F(x) is only an estimate of F(x) , 
and no matter how sophisticated a curve is fitted, there is an in- 
accuracy from the estimated function values. So the improvement 
in a density estimate by fitting a non-linear curve may be limited 
by the accura''y of estimating F(x). But some improvement In the 
estimation accuracy should be possible by using a nonlinear curve 
since the deterministic approximation to the density function may 
be better and hence the bin width may be wider. Thus the bin may 
contain more training samples. The tradeoff remains between 1.) increasing 
the bin size to contain more samples and hence increasing the accuracy 
of the estimation, and 2.) decreasing the bln size to obtain a better 
deterministic approximation to the density function; but it may be 
possible to change the balance point. 
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